A pseudoproxy assessment of why climate ﬁeld reconstruction methods perform the way they do in time and space

. Spatiotemporal paleoclimate reconstructions that seek to estimate climate conditions over the last several millennia are derived from multiple climate proxy records (e.g. tree rings, ice cores, corals, and cave formations) that are heterogeneously distributed across land and marine environments. Assessing the skill of the methods used for these reconstructions is critical as a means of understanding the spatiotemporal uncertainties in the derived reconstruction products. Traditional statistical measures of skill have been applied in past applications, but they often lack formal null hypotheses that incorporate the spatiotemporal 5 characteristics of the ﬁelds and allow for formal signiﬁcance testing. More recent attempts have developed assessment metrics to evaluate the difference of the characteristics between two spatiotemporal ﬁelds. We apply these assessment metrics herein to results from synthetic reconstruction experiments based on multiple climate model simulations to assess the skill of four reconstruction methods. We further interpret the comparisons using analysis of Empirical Orthogonal Functions that represent the noise-ﬁltered climate ﬁeld. The features of climate models and reconstruction methods identiﬁed in this paper demon- 10 strate more detailed assessments of reconstruction methods and point to important areas of testing and improving real-world reconstruction methods. Our results show that the CFRs derived within the CCSM, GISS, and MPI PPEs are skillful at recovering the mean structure, whereas the CFRs associated with the BCC and IPSL PPEs exhibit large biases that are consistent with those presented in et These results are likely due in part to the fact that the EOFs of the CCSM and MPI models are stable across the calibration and reconstruction periods. Additionally, the sampling network well represents the global temperature of GISS and MPI whereas is inadequate for the BCC model. This plays a key role in weakening the ability of CCA and RIDGE to reconstruct the mean of the BCC model. In terms of skill recovering the spatial covariance associated with teleconnections, the TTLS and TTLH methods outperform CCA and RIDGE, and in general CFRs derived in the CCSM PPEs outperform the CFRs associated with PPEs based on the other climate model simulations. Moreover, the CFRs of BCC and MPI show no skill in recovering the large-scale teleconnection patterns when the SNR is low. For the BCC model, this low skill is also corroborated by the observation that CFRs in the BCC PPE fail to represent the variability of teleconnections in the leading EOFs of the target model in the ENSO dominant region. Within the MPI PPEs, similar challenges reconstructing the spatial covariance are likely because the teleconnection in the model simulation is already weak, as indicted by the model’s low correlation between the leading ﬁve EOFs of the Niño3 region and those of the ENSO dominant region.

spatiotemporal differences among methods are simply due to random error. An additional challenge of previous statistical assessments is that they interpret the derived CFRs as complete spatiotemporal representations of the targeted climate field, despite the fact that most CFR methods target reduced-space versions of a field by selecting, for instance, only a few leading patterns from matrix decompositions of the field's covariance matrix. Despite such reductions being the basis of almost all CFR approaches, it is rare that skill assessments decompose reconstruction performance in terms of leading reconstructed and 70 targeted spatiotemporal patterns.
In an attempt to more rigorously compare spatiotemporal characteristics of reconstructed and targeted climate fields in PPEs, Li and Smerdon (2012) formalized a null hypothesis for these comparisons. Their approach was expanded by Li et al. (2016) who applied methods for comparing the mean and covariance structure between two spatiotemporal random fields developed by Zhang and Shao (2015). This method has significant advantages: it evaluates whether the spatially-varying mean 75 and covariance structures of two climate fields exhibit similar patterns, it is completely non-paramteric and thus free of the risk of model misspecification, and it is constructed to separate skill within a given EOF basis and thus allows assessments of skill within each leading pattern of spatiotemporal variability.
We use the formalism of Li et al. (2016) and the established PPE framework from Smerdon et al. (2016) to evaluate whether there are statistically robust differences between derived CFRs and targeted climate fields. As introduced in earlier and pre-80 liminary work at a conference (Lyubchich et al., 2017), we demonstrate how CFR skill can be separated into leading modes of variability, which allows us to better interpret the performance of each CFR in terms of the particular spatiotemporal characteristics of the climate model simulations on which each PPE is based. This approach allows us to more clearly articulate the reasons why the applied CFR methods perform differently within model-specific PPEs and across PPEs based on different last-millennium simulations. Our results demonstrate how our methods can be used to improve interpretations of uncertainties 85 and limitations in state-of-the-art CFR methods and provide improved understanding of how specific characteristics of the real climate system may give rise to enhanced or reduced CFR performance.

Data
The adopted experimental setup is specifically chosen to be consistent with previous PPE and methodological assessments of Smerdon et al. (2016) and Li et al. (2016). This consistency allows meaningful comparisons to previous results that were either 90 based on more traditional skill assessment metrics or did not fully diagnose the underlying reasons for skill differences using networks still lose significant numbers of records back in time.
The application of PPEs also requires that the time series subsampled from last-millennium simulations are perturbed with noise to mimic the imperfect connection between measurements in proxy indicators and the climatic signal for which they are interpreted. The common approach within PPEs is to add randomly generated noise series to the subsampled modeled time series representing proxy data, with noise amplitudes scaled to mimic the signal-to-noise ratios (SNRs) that are characteristic of 130 real-world proxies. In this study, we use the CFRs from Smerdon et al. (2016) that were derived from pseudoproxies perturbed with Gaussian white-noise at an SNR of 0.5, a value deemed to be within to the range of SNRs (0.5-0.25) in real-world proxy networks (Smerdon, 2012;Wang et al., 2014). In addition to SN R = 0.5, we also analyze a no-noise experiment (SN R = ∞).
In all model cases, the same realization of 283 Gaussian white-noise series are used to perturb the pseudoproxy network.

145
We analyze four CFR methods that have been widely applied in the CFR literature and specifically discussed in the context of the analyzed PPEs in Smerdon et al. (2016). These methods include two versions of regularized expectation maximization (RegEM) (Schneider, 2001;Mann et al., 2007), standard ridge regressions (Hoerl and Kennard, 1970) and canonical correlation analysis (CCA) (Smerdon et al., 2010b), all of which are briefly described in following subsections. All CFR methods use a calibration from 1850-1995 C.E. and a reconstruction interval from 850-1849 C.E. Temperature and proxy data are available 150 after 1995, but the proxy network as used in Mann et al. (2009a) becomes sparse after 1995 because many records collected over the last several decades obviously do not include measurements after their date of collection. Hence, as in Mann et al. (2009a), Smerdon et al. (2016), andLi et al. (2016), our calibration period is chosen to be 1850-1995 C.E., which follows the convention of previous PPE frameworks. We also note that while we test these specific configurations and methods, the skill assessments that we employ and the methodological insights that are developed are not exclusive to the four methods that are 155 tested because of broad commonalities across the CFR problem.

Regularized Expectation Maximization
The two employed versions of RegEM both use truncated total least squares for regularization (Schneider, 2001;Mann et al., 2007). The first is a standard version of RegEM truncated total least squares as originally described by Schneider (2001), here-inafter TTLS, and the second is the hybrid version applied by Mann et al. (2009a), hereinafter TTLH. The hybrid convention calibrates the multiproxy network on the target temperature field in split spectral domains by first separating the target field and the multiproxy (or pseudoproxy) network into high and low-frequency components. We follow the Mann et al. (2009a) convention by splitting these two domains at the 20-year period using a ten-point butterworth filter. The hybrid reconstruction is then derived by calibrating the pseudoproxy network in the two frequency domains using the RegEM algorithm and subsequently combining the reconstructions from each domain to derive a complete field (see Mann et al., 2005Mann et al., , 2007 for further description 165 of the hybrid method]. Note also that differences between reconstructions derived from the hybrid and standard versions of the RegEM method have been reported to be minimal Mann et al., 2005Mann et al., , 2007Smerdon et al., 2011), although the importance of hybrid calibrations on the skill of the derived reconstructions has been debated (Rutherford et al., 2010;Christiansen et al., 2010). A linear fit to the log-eigenvalue spectrum is used to determine the truncation parameter for the RegEM CFRs in the same manner that was advocated by Mann et al. (2007) for the high-frequency component of their 170 derived hybrid reconstructions. For the Mann et al. (2009a) CFRs, a linear fit to the log-eigenvalue spectrum was again used to determine the truncation parameter for the high-frequency component of the reconstructions, while the low-frequency truncation was determined by selecting the eigenvalue rank yielding 33% of the cumulative variance in the low-frequency field.
This percentage of retained cumulative variance is reduced from 50%, as originally adopted by Mann et al. (2007); the value of 33% has since been advanced by Rutherford et al. (2010) and Mann et al. (2009a) as more appropriate.

Ridge regression
We apply standard ridge regressions Hoerl and Kennard (1970) for the ridge regression CFRs in this study. The application of a single ridge regression was used by Smerdon et al., 2011Smerdon et al., , 2016 , but is otherwise a break from earlier studies that have used ridge regression as the form of regularization in the iterative RegEM algorithm. The application of ridge regression within the RegEM algorithm for the purpose of CE CFRs has been discussed in detail in various publications (Schneider, 2001;Mann 180 et al., 2005;Smerdon and Kaplan, 2007;Lee et al., 2008;Smerdon et al., 2008aSmerdon et al., , 2010aChristiansen et al., 2009). We use standard ridge regression instead of ridge-based RegEM herein, because the iterative RegEM ridge regression converges to the single ridge regression result in the special case of our PPE design, namely when missing values comprise a single and regular block in the data matrix. We determine the value of the ridge parameter for the single ridge regressions in the same manner applied by Schneider (2001) using ridge-based RegEM by minimization of the generalized cross validation function (Golub 3 Skill Assessment

A brief review of the functional methods
The methods of comparing two spatiotemporal random fields developed in Zhang and Shao (2015) and Li et al. (2016) are based on a functional data analysis approach. The basic idea is to perform the comparison in subspaces that are of much lower 195 dimension but preserve a large portion of the variability. Moreover, these comparisons can be done on individual EOF-PC pairs, allowing CFR assessments to be done on specific leading modes of the targeted and reconstructed fields. We briefly review the theoretical framework for the skill assessments below.
Let {X t (s)} N t=1 and {Y t (s)} N t=1 be two spatiotemporal random fields observed over spatial locations, s ∈ D, and time points, t = 1, . . . , N . We define the mean and covariance function of each spatial process as: µ X (s) = E{X t (s)} and µ Y (s) =

200
E{Y t (s)}, the mean functions over s ∈ D, and C X (s, s ′ ) = cov{X t (s), X t (s ′ )} and C Y (s, s ′ ) = cov{Y t (s), Y t (s ′ )}, the covariance functions of X t (s) and Y t (s) over s and s ′ ∈ D, respectively. To compare the mean and covariance functions of two spatiotemporal random fields, we consider the following two hypotheses: The two test statistics for these two hypotheses are T S1 and T S2, which are explained in detail in the following two subsections. Because the empirical distributions of T S1 and T S2 have been derived under H 0 , their p-values can be calculated.
The p-values for these two hypotheses are ultimately what are used to evaluate the comparison between two fields, in this case between the known model field and a CFR.

Mean comparison 210
The mean surface of a given climate field is a measure of its spatial variability across the global domain. In statistics, this is called the first moment of a spatiotemporal process and usually carries very important information about the distribution of the random process. Comparisons between the mean structures between two climate fields is therefore fundamental for assessing their relative characteristics. The mean structure will be compared in subspaces that contain the major variability of the climate field, so we start by defining the subspaces and projected mean differences prior to defining the test statistics (T S1). We denote 215 the ith eigenvalues and eigenfunctions, also called empirical orthogonal functions (EOFs), corresponding toĈ X by {λ i X } and {φ i X }, whereĈ X denotes the sample covariance function using all time points. Then we define a sequence of vectors consisting of the projected mean differences on the first L eigenfunctions: for 1 ≤ k ≤ N , where < x, y >= x T y, andμ X,k (μ Y,k ) denotes the sample mean based on the recursive subsamples {X t (s)} k t=1 220 ({Y t (s)} k t=1 ). Our test statistic for hypothesis (i) is therefore The parameter L is user chosen and determines how many eigenfunctions are to be used in the test.

225
The covariance structure refers to the correlation of climate observations over different locations. It is called second moment in statistics. When the climate field can be approximated by a Gaussian random field, the first and second moments determine the distribution of the entire random field. The covariance structure refers to either the local correlation or far-field correlation driven by so-called teleconnections within climate fields, and thus is an important description of the large-scale physical dynamics that underlie the climate system. To allow comparisons between leading patterns in modeled or reconstructed fields, 230 we modify the test for covariance to make it suitable for comparing two cross-covariance functions. We again define subspaces and projected differences of a covariance structure. Let C 1,2 X (s, s ′ ) (C 1,2 Y (s, s ′ )) be the cross-covariance function for s ∈ D 1 and s ′ ∈ D 2 and letĈ 1,2 X (s, s ′ ) andĈ 1,2 Y (s, s ′ ) denote the sample cross-covariance function for X t (s) and Y t (s) based on all time points. We perform a Singular Value Decomposition (SVD) onĈ 1,2 X (s, s ′ ) orĈ 1,2 Y (s, s ′ ), say onĈ 1,2 X (s, s ′ ):

235
where U and V are orthogonal matrices with columns being u 1 , ..., u n and v 1 , ..., v m for n and m grid cells in subregion D 1 and D 2 , respectively. LetĈ 1,2 X,k (Ĉ 1,2 Y,k ) denote the sample cross-covariance based on recursive subsamples {X t (s)} k t=1 ({Y t (s)} k t=1 ). That isĈ 1,2 X,k is the sample cross-covariance of {X t (s 1 )} k t=1 and {X t (s 2 )} k t=1 . We define a sequence of matrices by the projected covariance differences, Letα k be the vectorizedĈ k . The test statistic for hypothesis (ii) is where d is the length of the unique component inα k , which contains the elements on and below the main diagonal of C k . That Again L is chosen by the user and can be determined by the cumulative percentage of total variation. 245 Additionally, our notation uses X to represent the synthetic climate from climate models and Y to represent the CFRs based on the X process. The test statistics of above two tests will change if we calculate the sample covariance matrix based on Y process rather than the X process, because the EOFs from Y are different than those from X. Thus, they are not exchanagable but we have fixed the sample covariance matrix based on the X process because the goal of our application is to evaluate the skill of CFRs by comparing them to their known targets, the climate model output on which each PPE is based.

Mean-structure skill
Despite the formalism of the preceding section, the important implication is that comparisons between modeled and reconstructed fields can be measured in terms of p-values based on a null hypothesis that similarities are within the range of comparisons between two random spatiotemporal fields. In other words, a p-value close to 1 reflects similarities between 255 modeled and reconstructed fields that are statistically significant against this null hypothesis, while p-values close to 0 reflect differences that could be explained by random chance. Moreover, these comparisons are broken out among the leading spatiotemporal patterns in each field. While such comparisons can be done for any number of leading principal components, we focus herein on the leading five in each field. Subsequent comparisons are made between the CFRs in each model-based PPE and the known model field during the reconstruction interval (850-1849 CE).

260
The mean-structure performance, in terms of the developed skill metric for the five leading EOFs, is shown for each CFR method within each of the model-based PPEs in Figure 2; results are shown for PPEs using pseudoproxies with SN Rs = 0.5 PPEs, while the BCC and IPSL models appear to present the most challenging tests for the CFR methods. Secondly, the level of noise in the pseudoproxies has an important and expected impact on the nature of the methodological performance. Particularly for the CCSM, GISS and MPI based PPEs, the no-noise experiments yield much higher skill scores than the SN R = 0.5 experiment. Notably, however, even the no-noise PPEs yield CFRs with variable skill that depends on method and model.

270
With regard to the performance of specific methods, TTLS and TTLH are generally most skillful across the top five EOFs in the CCSM, GISS, and MPI PPEs, although that is not true across all of the EOFs and is more ambiguous for the CCSM experiment with SN R = 0.5. It is also true that the TTLS and TTLH methods perform similarly within each model-based PPE across the top five EOFs, which is not surprising given the close methodological lineage of the two methods . Similarly, the CCA and RIDGE methods have similar skill performance for each of the five EOFs across the PPEs, 275 although the CCSM experiment shows some ambiguity with regard to these general observations again in the SN R = 0.5 case. Finally, the methods collectively perform the worst within the BCC and IPSL PPEs, a finding that is again consistent with the mean bias assessment in Smerdon et al. (2016) who found the largest mean biases in the BCC and IPSL PPEs.
In addition to the above general observations, the applied skill metric allows the skill associated with each of the leading EOFs to be separated. Nothing similar to these separations were performed in Smerdon et al. (2016) and they indicate a 280 complicated structure associated with skill across each of the model-based PPEs and tied to the applied method. For instance, some methods perform very well on several leading EOFs, while performing very poorly on several others (CCA in the CCSM PPE or CCA and RIDGE in GISS PPE). Other methods perform poorly on the leading EOF, while performing very well on the remaining EOFs (TTLS and TTLH in the CCSM and MPI PPEs) or perform poorly on all EOFs except the 5th EOF (CCA, the methods across the different model-based PPEs, but the reasons for this performance is not immediately obvious from these assessments. We therefore perform a similar analysis in the next subsection for the covariance-structure skill, before working to more deeply understand the performance of the CFR methods as indicated by the applied skill metric.

Covariance-structure skill
Similar to the mean-structure comparison, we employ the applied skill metric to evaluate how derived CFRs reproduce the 290 known covariance of the climate model simulations. We first note, however, that the covariance comparisons between the CFRs and the known climate model fields over the entire reconstruction domain yielded results that were universally unskillful. In other words, our analyses yielded p-values equal or close to 0 for all methods at all five leading EOFs and across all PPEs. This result is perhaps not unexpected given an established understanding that there are large regions with very low skill throughout the global CFR domain. Smerdon et al. (2016) demonstrated that many regions of the reconstructed fields have small and 295 insignificant correlations relative to the known model fields, while locations among the tropics and over dense pseudoproxy sampling locations achieve much larger correlations. These collective results thus suggest that if the global domain is used to identify EOFs many of the locations will be defined by variability dominated by noise. Alternatively, constrained domains that encompass dominant regions of variability can be used to target leading EOFs that are less susceptible to noise. We therefore modify our approach in this section to describe comparisons between areas of dominant teleconnections in the model fields.

300
Our modified approach is to analyze the covariance structure only in regions where the teleconnection associated with the El Niño-Southern Oscillation (ENSO) is dominant. We specifically focus on ENSO because it is the leading mode of internal variability on a global scale, making it easy to identify and likely strongly expressed in the leading few modes of each climate model simulation. We examine the ENSO dependencies by computing the correlation between the time series of averaged temperatures over the Niño3 region (5 • N-5 • S, 150 • W-90 • W), and the time series at all other grid points in the 305 global temperature field. Maps of these correlations for each climate model are shown in Figure 3. We discard locations that are proximal to the ENSO region (local covariance structure) that are not the consequence of the ENSO teleconnection (large-scale covariance structure). An empirical covariance estimate suggests that pairs within 10,000 km are due to this proximity. Thus, we exclude the locations in the orange shaded area (20 • S-20 • N, 150 • E-35 • W) that are within 10,000 km from the center of the Niño3 region, as shown in Figure 3. After excluding these proximal locations, we choose the grid points (black dots in 310 Figure 3) that have significant positive or negative correlations with the Niño3 index in each model at the 10% significance level, which we interpret as reflecting each model's ENSO teleconnection pattern. Because the selected grid points vary for different climate models, we use the collection of overlapping grid points (black dots) from all five climate models. The p-values for the modified covariance-structure skill metric assess how well the large-scale teleconnection patterns associated with ENSO are reproduced in the CFRs, relative to the known model fields. These results are shown in Figure 4, which 315 presents the p-values for all four CFRs for the leading five PCs using the SN R = 0.5 and ∞ PPEs. Relative to assessments over the entire domain, stronger associations between the CFRs and known model fields are observed when the covariance structure is limited to the ENSO teleconnection regions. Even with a constrained focus on the ENSO teleconnection regions, however, the covariance-structure skill is still limited across most of the methods and model-based PPEs. The TTLS and TTLH methods are again the most skillful across all of the methods. In the case of no-noise (SN R = ∞), skill is detected for the 320 CCA method across all model-based PPEs and there is some skill for the RIDGE method except for the IPSL and MPI PPEs.
Interestingly, the skill of TTLS and TTLH is higher for most of the EOF patterns in the SN R = 0.5 for CCSM, GISS, and IPSL PPEs, relative to the no-noise case, while it is more typical to have skill reduction for SN R = 0.5 as in the CFRs derived for the BCC and MPI PPEs. Specifically, compared to the no noise case, CCA and RIDGE results show skill reduction for SN R = 0.5 for all PPEs. We also note that for the covariance comparison, it is particularly important to examine the skill 325 of CFRs at the first PC because the first EOF contains over 80% of the total variation in all models (see further discussion in  To complement the analysis of the covariance-structure skill in the ENSO teleconnected regions, we investigate the proportion of variance explained by the first five leading EOFs of the ENSO teleconnection dominant region (D 2 ). Figure 5 shows that more than 30% of the variance is explained by the first EOF in CCSM and IPSL models and all the other three models  Figure 3 shows that MPI exhibits a strong teleconnection signal. This is particularly so as noise is added to the pseudoproxies and especially for the CCA and RIDGE methods. Thus for the MPI model, the skill of most of the CFRs is 340 associated with the first PC when there is no noise (SN R = ∞) but none when the noise is present (SN R = 0.5).   Figures 2 and 4, the skill as a function of cumulative variance reveals that most methods across most PPEs do not recover mean-structure skill beyond about 30% of the total variability. Regarding mean-structure skill specifically, TTLH exhibits the most skill within the CCSM and GISS PPEs, and only up to about 20% -30% of the total variation. In the GISS    Figure 6 -7 indicate that even though CFRs show some skill for each individual PC, the cumulative variability that is skillfully explained must be evaluated for a more complete picture of methodological performance. This is especially true for the TTLS and TTLH methods. Despite showing outstanding skill recovering the modeled mean structure at each PC in most of the climate models in Figure 2, Figure 6 shows that the two methods often do not account for skill up to higher order cumulative 365 EOFs. For example, both TTLS or TTLH are skillful only up to 20-30% of the total variation of the climate field. Additionally, we note that the CCA method poorly recovers the mean structure in most of the climate models and both CCA and RIDGE are poor in recovering the covariance structure in all five climate models when the comparison is projected onto the cumulative EOF basis function. characteristics of the climate simulated by each model. In the following subsection, we therefore characterize the features of the temperature fields simulated by the models and the underlying consequences for the various CFR methods. We interpret the skill assessments by exploring several features of the CFRs and the underlying model fields on which the PPEs are based: 375 (i) the percent variance explained by the leading EOFs in the modeled temperature field, (ii) the temporal stability of the EOF structure in the reconstruction and calibration periods, and (iii) the degree to which the spatiotemporal variability in the modeled temperature fields are represented by the locations where pseudoproxies are sampled.

Structure of the Eigenvalue Spectrum
Because each of the CFR methods investigated in this study are forms of regularized multivariate regression, they all share 380 a similar feature, namely they each only target a few of the leading EOFs in the target temperature field. An important control on the skill of CFRs is therefore tied to how much of the variance in the target temperature field is explained by the leading EOFs. We therefore hypothesize that the PPEs based on the climate model simulations with significant amounts of variance in the leading few EOFs will be those experiments in which the CFRs perform most skillfully.
In Figure 8  In addition to providing a broad assessment of the relative challenges presented by the individual model-based PPEs, the eigenvalue spectra for each of the CFRs in each of the model experiments also indicate that the similarity between the variance explained in the first several EOFs of the target and reconstructed fields is largely indicative of the performance of the individual CFR methods. In particular, the proportions of the first eigenvalues in the TTLS and TTLH CFRs are almost equivalent to those 395 of the true model fields from the CCSM, GISS, IPSL, and MPI simulations (Figure 9). This is reflective of the fact that those two methods generally performed the most skillfully in both of the skill assessment metrics. In contrast, the proportions of the first eigenvalues of the CFRs in the BCC PPE are significantly lower than that of the true model, which matches with the relatively poor skill of all CFR methods based on the BCC PPE. CFR performance is therefore strongly associated with how well the first EOF represents the total variation in the targeted climate field and how well that variance is reproduced in a given 400 CFR. While the above analyses of the eigenvalue spectra give important insights into the difficulty of reconstructing a given climate field and the likely performance of a reconstruction that targets such a field, the variance explained by a given set of EOF-PC pairs alone may not be fully indicative of reconstruction performance. For instance, it is possible that the EOFs in the reconstruction are reordered so that they do not represent well the spatial characteristics of any given EOF in the target field. It 405 is therefore useful to assess how well the spatial characteristics of specific EOFs in a CFR represent the spatial characteristics of the EOFs in a targeted field.
To assess this feature and allow for the fact that a given reconstructed EOF may be ordered differently than the equivalent EOF in the target field, we take the inner product between each of the first three EOFs in the reconstructed and targeted fields (this is similar to the spatial correlation statistic often discussed in the climate literature, e.g. Baek et al., 2017). If the absolute 410 value of the inner product is close to 0, it suggests that the spatial patterns represented by two EOFs are very different, while if the inner product is close to 1, it implies that they are equivalent. The p-values testing the significance of the inner products can be derived using bootstrap analysis. Each sample is obtained by bootstrapping spatial locations at each time point, from which the inner product of the CFR and the associated true climate model is calculated based on the sampling. For each inner product pair, we perform bootstrapping 1,000 times and calculate the p-value of the observed inner product. Table 1 presents the inner products of the first 3 EOFs for the CFRs and the corresponding climate model fields, with significance also indicated. Inner products along the diagonals that are close to one and marked as significant indicate that the order of their corresponding EOFs together with their spatial representations in the CFRs are similar to those of the targeted climate field. Values close to one, significant, and off the diagonal would indicate a potential reordering of the reconstructed EOFs, relative to the EOF structure of the target field. Collectively, the inner products indicate that in addition to reflecting 420 similar eigenvalue spectra (Figure 9), CFRs targeting the CCSM and MPI models also produce EOFs that are similarly ordered with patterns that well represent the true model EOF patterns. This is represented by high and significant inner product values along the diagonals; the opposite is true of most experiments with the BCC model. In summary, in order for the CFRs to well 20 https://doi.org/10.5194/cp-2020-153 Preprint. Discussion started: 23 February 2021 c Author(s) 2021. CC BY 4.0 License. depict the mean and covariance structure of the true climate model, the first few leading EOFs should carry the majority of the total variation while also capturing the spatial features of the targeted EOFs as shown in Table 1

Temporal stability of the leading EOFs
An important underlying assumption of linear regression based CFR methods is that the identified patterns in the calibration period remain temporally stable back in time over the period of reconstruction. In particular, temporal stability refers to how much the leading patterns of modeled data in the reconstruction period and in the calibration period share in common and to what extent the order of leading patterns in the calibration period is preserved in the reconstruction period. If these patterns 430 are not temporally stable, a key assumption of the reconstruction approach is violated and the skill of the reconstruction will be affected. Differences in the performance of CFR methods, such as the differences in the mean structures assessed in Figure   2, may therefore be explained by differences in the temporal stability of leading EOFs in the calibration and reconstruction intervals within each of the model simulations that form the basis of the PPEs. In other words, if the EOF character and structure within the reconstruction interval is similar to that of the calibration interval within a model simulation, all of the CFRs based on this model simulation are expected to capture the mean structure better than the CFRs based on simulations for which this is not the case. Despite the above significance for reconstruction methods, it is unknown whether teleconnections in the real climate system remain stable over centennial to millennial time scales or how widely they have varied if they are not stable (e.g. Coats et al. 2013).
To test the stability of the teleconnections in the model simulations, we again use the inner product as a measure of the 440 similarity between spatial patterns, in this case between the EOFs in the calibration and reconstruction periods. These inner products are listed in Table 2 and the p-values of the inner products are again computed through bootstrapping; the pairs of EOFs that are significantly aligned are marked. If the inner product matrix in Table 2 contains the highest values along the diagonal and those values are significant in the bootstrapping experiments, it suggests that the order and character of the EOFs are similar in the calibration and reconstruction intervals. This is predominantly the case for the CCSM and MPI simulations,

445
implying that the reconstruction period is defined by the same dominant pattern of leading EOFs in the calibration period.
Moreover, for those two climate models, the order of the modes are preserved as well. In contrast, the BCC model reveals very weak associations between the calibration and the reconstruction periods, and IPSL only displays strong association for the first EOF.
The temporal stability assessment, when joined by the previous assessment of the eigenvalue spectra allow a more specific

Sampling locations
The sampling locations of proxies also play a key role in the performance of CFRs, because all CFR methods train their statistical models based on how the entire climate field relates to the climate variability reflected in proxy locations. If the climate variability at sampling locations poorly represents the variability of the entire climate field, then it will be very challenging for CFRs to reproduce the mean or covariance structure of the targeted climate. To investigate this possible issue, we sample 465 the climate from only the proxy sampling locations and then study the capacity of the climate at those locations to recover the climate globally. This is carried out by directly using the EOFs at the sampling locations to estimate the climate at other locations and examine the mean squared error (MSE) of the estimates.
In order to account for spatial correlation in this context, we first decorrelate the spatial climate simulation before fitting a linear model and then add the correlation back after we obtain the estimates. More specifically, let X * m (s, t) denote the spatially 470 decorrelated climate simulation obtained by where m is the index for a given model simulation (e.g. BCC, CCSM, ..., MPI) and Σ m is an estimated spatial covariance matrix of X m (s, t) with t = 850, ..., 1849 using an exponential covariance function.
There are 283 sampling locations out of 1732 grid points. Let f j be the j-th PC of the sampling network from year 850 to 475 1849. We construct a linear model of the climate at all 1,732 locations and on f j with the decorrelated spatial fields X * m (s, t): where ǫ(s, t) are white noise because X * m (s, t) has been decorrelated. So for each fixed location s, we have 1,000 observations (t = 850 − 1849) and 1,732 different regressions will be modeled on the whole domain D. We set the number of EOFs to be K = 10 which typically preserves about 85% of variability in the sample climate field. After we obtain X * m (s, t) then we derive . To evaluate the model fitness, we calculate the mean squared error (MSE) as follows : The M SE m measures how well the sampling network represents the variability in the climate model simulation. The basic idea is to measure how much climate variability can be recovered based on the sampled climate alone. We by no means argue that our method is optimal for this purpose, but this MSE estimate provides a reasonable measure for the capacity of climate 485 sampled at the pseudoproxy locations to represent the simulated global climate in each model. The MSE in each of these simulations is relatively high throughout the global field, with CCSM and IPSL displaying extremely 495 high MSE in the northern extratropics and polar region and IPSL also yielding high MSE in parts of the southern extratropics and polar regions.
The sampling network in BCC well represents the temperature variability around the equator, however, it yields very high MSE in the NH extratropics. This makes the distribution of MSE associated with BCC largely skewed to the right due to the extremely large MSEs in the NH extratropics (Table 3 and Figure 10). To further aid our interpretation, we provide the maps 500 of the first and second EOFs in Figure 11. A joint comparison between Figures 10 and 11 shows that the variability of the first and second EOFs of BCC mainly concentrates in the Northern Hemisphere where the large MSE is observed. The implication is that the pseudoproxy sampling network in BCC does not well sample variability in the NH extratropics, while the leading EOFs in BCC best represent variability over that region. This collectively further explains the poor performance of CFRs with BCC in Figure 2 (SNR=0.5).
505 Figure 10 also indicates that the MSE is high in CCSM and even higher for the IPSL model. Figure 11 nevertheless indicates that the main difference between CCSM and IPSL is that the CCSM simulation shows strong signal throughout the leading EOFs whereas the IPSL model only shows distinct signal in its first EOF. This helps explain the skill of CFRs associated with the IPSL PPE concentrating in its first EOF (Figure 2). On the other hand, the GISS and MPI models exhibit the smallest mean MSE, thus supporting the outstanding skill of their CFRs in reconstructing the mean. However, the performance of CFR 510 methods, especially CCA and RIDGE, seems also additionally vulnerable to the skewness of the MSE, implied by Figure 2 (SNR=0.5). If we compare the CFR performance associated with the GISS and MPI PPEs, both TTLS and TTLH perform well but CCA and RIDGE perform better in the MPI PPE due to the relatively high skewness of MSE in the GISS model.
In summary, both the skewness of the MSE and the high MSE distribution with weak signal on the leading EOF structure together affect the skill of CFRs in all climate models. This is because large differences between the global climate and what 515 can be sampled from the proxy network likely weakens the skill of CFRs in retaining the major mean structure of the targeted climate. In contrast, however, even if the mean MSE is high due to high variability of the temperature field, the mean structure may be well reconstructed by the CFRs if the leading EOF shows distinct signal.

Discussion and Conclusions
We have provided a comprehensive assessment of four widely-applied CFR methods in terms of their skill recovering the mean 520 surface and covariance patterns in the targeted temperature field. The assessment is conducted in the context of PPEs based on five climate model simulations spanning the 850-1995 CE interval. We have first applied the evaluation metrics presented in Li et al. (2016) and Zhang and Shao (2015) to assess the skill of each CFR with respect to the differently modeled climates.
We then focused on interpreting and understanding the variability in the skill. We find that although part of the skill variability arises from the reconstruction method itself, a large part of the discrepancy in the skill across different methods is attributable 525 to different characteristics of simulated temperature fields associated with different climate models. Our discoveries provide useful insights into the assessment and improvement of CFR methods, while the focus on the underlying characteristics of the targeted climate field make our findings relevant beyond the four methods that we have tested.
The underlying features of a targeted temperature field that can affect the performance of CFRs, as represented across the climate model simulations that we have investigated, include: (i) the characteristics of the eigenvalue spectrum, namely the Our results show that the CFRs derived within the CCSM, GISS, and MPI PPEs are skillful at recovering the mean structure, whereas the CFRs associated with the BCC and IPSL PPEs exhibit large biases that are consistent with those presented in 535 Smerdon et al. (2016). These results are likely due in part to the fact that the EOFs of the CCSM and MPI models are stable across the calibration and reconstruction periods. Additionally, the sampling network well represents the global temperature of GISS and MPI whereas is inadequate for the BCC model. This plays a key role in weakening the ability of CCA and RIDGE to reconstruct the mean of the BCC model. In terms of skill recovering the spatial covariance associated with teleconnections, the TTLS and TTLH methods outperform CCA and RIDGE, and in general CFRs derived in the CCSM PPEs outperform the CFRs An important finding is that the skill of CFRs is highly associated to how well the leading EOFs in CFRs represent the targeted climate field concerning both the variability and the subspace. We find that the spectra of eigenvalues in the CCSM, GISS, and MPI models align well with their own CFRs. Among the four CFRs, the TTLS and TTLH methods better recover the eigenvalue spectrum of the targeted climate by having a large amount of variability carried by leading EOFs. In particular,

550
CCSM exhibits the highest variability on its first few leading EOFs and this pattern is well reproduced by the corresponding EOFs in the CFRs derived from the TTLS and TTLH methods. Critically, these characteristics could be assessed for real-world data sets or through comparisons between CFRs and the observational data during the calibration and validation intervals. Such assessments are therefore strongly encouraged as additional means of testing both the likelihood of skillful reconstructions, as well as adding to a source of calibration and validation interval skill metrics.

555
Overall, the skill assessment we have performed using PPEs based on five climate models allow a deeper understanding of both the reconstruction methods and the characteristics of the synthetic climate fields. As we have shown, CFR assessments can vary based on the underlying spatiotemporal characteristics of the modeled target field. The ultimate goal is to evaluate and improve real-world CFRs. Based on the results of this study, the reconstruction performance can depend on the eigenvalue spectrum, the temporal stability of covariance patterns across the reconstruction and calibration intervals, the ability of sampling 560 locations to represent the global climate characteristics, and the strength of the dominant teleconnections in the targeted climate field. A careful investigation of the characteristics of the real-world climate will help identify the likely impact of these features in CFRs derived from real proxies, as well as choose optimal reconstruction methods and proxy networks given the identified characteristics of targeted climate fields. Although the characteristics of the real climate of course cannot be modified, our findings will also help to define absolute limits on the skill of CFRs and thus improve their interpretations.