Differences between paleoclimatic reconstructions are caused by two factors: the method and the input data. While many studies compare methods, we will focus in this study on the consequences of the input data choice in a state-of-the-art Kalman-filter paleoclimate data assimilation approach. We evaluate reconstruction quality in the 20th century based on three collections of tree-ring records: (1) 54 of the best temperature-sensitive tree-ring chronologies chosen by experts; (2) 415 temperature-sensitive tree-ring records chosen less strictly by regional working groups and statistical screening; (3) 2287 tree-ring series that are not screened for climate sensitivity. The three data sets cover the range from small sample size, small spatial coverage and strict screening for temperature sensitivity to large sample size and spatial coverage but no screening. Additionally, we explore a combination of these data sets plus screening methods to improve the reconstruction quality.

A large, unscreened collection generally leads to a poor
reconstruction skill. A small expert selection of extratropical
Northern Hemisphere records allows for a skillful high-latitude
temperature reconstruction but cannot be expected to provide
information for other regions and other variables. We achieve the
best reconstruction skill across all variables and regions by
combining all available input data but rejecting records with
insignificant climatic information (

In the past 20 years, a lot of effort has been invested in improving climate reconstructions for the last centuries to millennia based on indirect climate information – so-called “proxies”. Focus has been on both large-scale averages as well as the reconstructions of regional to global fields (Masson-Delmotte et al., 2013; Smerdon and Pollack, 2016). Temporal and spatial resolution varied with the included paleoclimatic archives. However, most reconstructions for the past centuries rely heavily on the most abundant indirect climate archive, tree rings, and specifically on tree-ring width (TRW) and late-wood density (MXD). Differences between reconstructions have mostly been discussed with differences in reconstruction methodology in mind (Christiansen and Ljungqvist, 2017). However, a new study shows good agreement between a wide range of methods if reconstructions are based on the same input data set (Neukom et al., 2019a, b). Another recent study found that temperature-sensitive tree-ring proxies from the PAGES2k database (Emile-Geay et al., 2017) lack multi-centennial trends, which are found in other proxy archives (Klippel et al., 2019). This suggests that the input data play a crucial role for differences between reconstructions. This fact is also seen in data assimilation for weather prediction, e.g., at the addition of satellite to radiosonde observations (Swinbank et al., 2012, p. 365). Today, many proxy data archives are available; hence compiling input data for reconstruction is not only a matter of the amount of proxy data, but also of their selection, i.e., screening.

In this study, we therefore aim at evaluating the effect of various tree-ring data collections and their screening on the final reconstructions. Tree-ring proxies are by far the most numerous climate information source for the past centuries and additionally chosen because our methodology relies on annual data without dating uncertainties. Due to the relevance of temperature in the climate change discussion and the fact that many biological proxies react to temperature stress, temperature has so far been the variable of most interest. However, to study the underlying processes a multi-variable perspective is required. Therefore, we evaluate the effects of the input data choice, using a state-of-the-art data assimilation technique, which allows for multi-variable climate reconstructions in the form of model simulations that are in optimal agreement with proxy information (Bhend et al., 2012; Franke et al., 2017a).

A number of previous studies based on data assimilation techniques tended to assimilate a high quantity of input data instead of applying strict data selection beforehand (e.g., Steiger et al., 2018; Tardif et al., 2019). The idea is that regression-based proxy system models weight each proxy series by their regression residuals. Hence, proxies that do not contribute information will be downweighted automatically. However, this weighting may not work perfectly because of two factors: (1) the regression depends on overlapping paleodata and instrumental measurements, which often results in a small sample (Fig. 1 in Jones et al., 2012), uncertain residuals and possible model overfitting; (2) moisture- and temperature-sensitive proxies may correlate and hence moisture-sensitive paleodata will be used to correct temperature and vice versa. However, these two variables probably have very different multidecadal to centennial variability (Franke et al., 2013). The growth-limiting factor may even change over time (Babst et al., 2019).

Proxy locations of the three collections.

In this study, we use the Kalman-filter-based state-of-the-art data assimilation technique introduced in Bhend et al. (2012), which is very similar to the methodology used in the last millennium reanalysis (LMR) project (Hakim et al., 2016; Tardif et al., 2019). In contrast to LMR, our method is a transient-offline method, in which the background state is time-dependent due to the external forcing prescribed for the climate model simulations. In our experiments, we focus on the effect of the input data choice on the final reconstruction. We compare three published collections of tree-ring records (focusing on TRW and MXD), of which at least two are commonly used for climate reconstructions. These have very different characteristics: (1) the B14 collection of 2287 consistently detrended TRW chronologies from the International Tree Ring Data Base (ITRDB), not screened for climate sensitivity (Breitenmoser et al., 2014); (2) TRW and MXD series from the PAGES2K database version 2 (Emile-Geay et al., 2017), with a selection of 415 temperature-sensitive records, most selected by a statistical screening for positive correlation with instrumental temperature; and (3) the N-TREND tree-ring collection of 54 TRW, MXD or blended TRW-MXD time series (Wilson et al., 2016), selected by experts to be the best temperature recorders. Thus, the three data sets cover the range from large sample size and spatial coverage but no screening for temperature sensitivity to small sample size and small spatial coverage but strict screening. Note, that these collections were generated with slightly different aims, which affects their use in reconstructions. Thus, for instance, we cannot expect to achieve the best global-scale field reconstruction from a proxy collection covering a much smaller area (Kutzbach and Guetter, 1980). However, all data sets are used for climate reconstruction.

In the next section the method and data sets are introduced in greater detail before we show our results. Then we discuss the possible reasons for our results and the differences compared to previous studies. Finally, we draw our conclusion on what an optimal proxy selection process should look like.

We use three input data sets for comparison; all consist of
annually resolved tree-ring measurements, which have hardly any
dating uncertainties.

B14 is a collection by Breitenmoser et al. (2014) of 2287 uniformly detrended and standardized TRW measurements from the ITRDB (Zhao et al., 2018). We use the full collection without any further screening for climate or temperature sensitivity. Hence, this represents the data set with the highest quantity of records. However, the weighting of temperature information in the paleodata is completely down to the reconstruction method.

PAGES2k is a collection of 415 TRW and MXD series from PAGES2k
data base version 2 (Emile-Geay et al., 2017). These are all
records that correlate significantly (

N-TREND is a collection of 54 tree-ring chronologies based on TRW, MXD or a combination of both. They were chosen by experts with the purpose being to provide the best tree-ring paleodata for temperature reconstructions (Wilson et al., 2016). Hence, they are our low-quantity, highest-quality input data set with the least spatial coverage.

EKF is the offline variant of the ensemble square root filter
(Whitaker and Hamill, 2002), in which the observations (

A recent comparison of Valler et al. (2019) has shown superior performance when using an improved covariance estimation, which blends 50 % of the 30-member time-dependent covariance with 50 % of a 250-member “climatological” time-independent covariance (experiment 50c_PbL_Pc2L in Valler et al., 2019). In this paper we use both the original setting as in Franke et al. (2017a) as well as the improved setting proposed by Valler et al. (2019).

Our paleo-reanalysis is based on anomalies from a 71-year period around the current year. Low-frequency variability is a function of the models' response to the prescribed external forcings and background conditions, which include sea surface temperatures. Because low-frequency variability is not consistently preserved in paleodata (Franke et al., 2013; Klippel et al., 2019) but reasonably well represented in the model simulations of the last millennium (Franke et al., 2017a), this approach is expected to provide consistent skill at all timescales. Note that the assimilation of anomalies retains possible model biases. This circumvents a big problem in data assimilation approaches with temporally varying input data networks. Observations that gradually pull the model away from its biased state, can lead to artificial trends or step functions in time series.

We use a linear multiple regression PSM to simulate tree-ring
observations using modeled temperature and/or precipitation. The
regression model is calibrated with gridded instrumental data (CRU
TS 3.1, Harris et al., 2014) in the period 1901–1970. It includes
monthly temperature (and precipitation) during the growing season
April to September. In this study, we limit the analysis to the
Northern Hemisphere because the majority of the tree-ring
observations can be found there. In the first four experiments
(Table 1), which only use temperature (

In this study, the period in which the regression coefficients of the PSM are estimated and the regression residuals are calculated overlaps with the period when the reconstruction skill is estimated. This apparent lack of independence is negligible in this case because regression coefficients are estimated from gridded instrumental data sets to translate grid cell temperature (and moisture) anomalies to local tree-ring measurements. The optimization is done on tree rings, not on the climate data, and it is done on many local scales. In that sense the effects of the dependence are rather indirect. In contrast to statistical reconstruction methods, which directly estimate a climate variable such as temperature through the regression parameter estimate, our assimilation method is far less affected by the calibration procedure. Nevertheless, using the same data for validation probably leads to a slight overestimation in reconstruction skill. However, in this study we just compare the relative skill of various inputs data sets, so the impact of dependencies will be the same for all. Concerning the regression residuals, again the error estimate concerns tree-ring width, not climate parameters. We use the residuals as an estimate of error covariance. In case we underestimate the residuals, proxy observation would have too much weight in the assimilation process compared to the simulations. Uncertainty estimation in both observations and models is a crucial but challenging part of data assimilation. We evaluate the spread-to-error ratios to assess the under- or overconfidence of our reconstructions (Franke et al., 2017a).

If multiple data collections are combined, there may be duplicates
of the same proxy, possibly in differently treated or detrended
versions. We conduct experiments where we prevent single sites from
being assimilated twice by only choosing the best proxy (smallest
regression residuals irrespective of series length) in
a

Experiments.

We evaluate the quality of the reconstruction based on correlation
with gridded instrumental observations of temperature, precipitation
(Harris et al., 2014) and sea-level pressure (Allan and Ansell,
2006) in the period 1901–1990 as a reference (

To evaluate the influence of the input data on the final reconstruction, we conducted the set of experiments described in Table 1.

Pearson correlations coefficients between the analysis and gridded
instrumental data in the 20th century. Panels

Temperature correlation improvement of the analysis over the original model simulations, i.e., correlation between analysis and CRU TS minus correlation between simulations and CRU TS, where red colors indicate an improvement of the analysis. All maps show the April to September growing season of the Northern Hemisphere.

Same as Fig. 3 for precipitation correlation, where green colors indicate an improvement of the analysis.

Temperature correlation coefficients between the analyses and gridded instrumental data are positive nearly all around the globe and for all three proxy collections (Fig. 2a–c) because the transient simulations follow forcings and boundary conditions and hence show proper multidecadal variability and a 20th-century warming trend. However, this is not the case for precipitation, which does not show a warming trend (Fig. 2d–f). In contrast to the assimilation of PAGES and NTREND (Fig. 2d, e), we can observe clearly higher correlations in the United States if the B14 proxies are assimilated (Fig. 2f). Although these first three experiments only use a temperature PSM, information can spread to other variables through the covariance matrix.

To evaluate the differences between the experiments due to the data assimilation we focus on correlation improvement over the background (i.e., the model simulations, which already correlate with the reference data set mainly due to the specified sea surface temperatures (SSTs) and external forcing). First, we compare the role of the choice of the three input data sets assuming only temperature dependence and no constraint on the regression model structure (Fig. 3a–c; experiments NTREND_T, PAGES_T and B14_T). The highest local improvements are reached with the NTREND data set; however the largest spatial coverage of improvement is found with the B14 data set. Note that temperature correlation improves with all data sets and decreases nowhere, although some proxy records in the B14 data set do not contain any temperature signal. This has been identified with negative regression coefficients for the majority of B14 tree-ring series in the United States of America. In terms of correlation the data assimilation scheme appears to weight the input data appropriately. Looking at precipitation and sea-level pressure correlation improvements (Figs. 4 and 5a–c), we find hardly any improvements with the NTREND collection. In contrast, the B14 data set leads to some precipitation correlation improvements over North America, where no NTREND series are located. Sea-level pressure correlations improve in some regions such as Europe but decrease in other regions like most of Asia (Fig. 5c).

Same as Fig. 3 for sea-level pressure correlation, where red colors indicate an improvement of the analysis.

The correct sign of the anomaly, measured by correlation, only tells us about one aspect of the reconstruction quality. To see if the amplitude of the anomaly is also reconstructed correctly, we look at the RMSESS skill score (see the “Data and methods” section). Here, we find large differences between the proxy collections (Fig. 6). With NTREND_T we find improvements everywhere, whereas B14_T shows more regions with negative than positive skill (note that we use PSM with only temperature). The PAGES data set has mainly positive skill but negative skill in a large region around the Himalaya and in some parts of North America. This suggest that using moisture-sensitive proxies to reconstruct temperature as in B14_T, which works just because temperature and precipitation are correlated at a given location, is not ideal. Hence, further experiments with an improved PSM and upgraded screening procedure were conducted to take the proxies' temperature or moisture sensitivity better into account and to find an option to use the PAGES and B14 collection at locations where no expert-selected proxies are available but rather keep the quality of the expert-selected data where it is available.

Before we come to a more sophisticated PSM and more sophisticated input data screening, we simply combine all three data sets still using a model with only temperature (ALL_T). This experiment performs well. Temperature correlation now reaches levels of the NTREND_T experiment, where NTREND data are available, and additionally correlation improvements cover the regions where only PAGES or B14 have data (Fig. 3d). RMSESS values are positive in most regions, too. However, around India and the Himalaya negative skill is likely related to the impact of the PAGES data, whereas negative skill in the US southwest seems to be the result of B14 data modeled as temperature only. Precipitation correlations improved only marginally (Fig. 4d) and precipitation RMSESS (Fig. 7d) is mostly negative.

Temperature RMSESS skill score, where red colors indicate an improvement of the analysis.

The obvious change to improve precipitation reconstruction skill is to use a PSM that includes precipitation, i.e., a multiple regression model with 12 coefficients for temperature and precipitation influence during the 6 months' growing season (experiment ALL_TR). Temperature correlation and skill remain at the same high level (Figs. 3e and 6e), but precipitation correlations improve everywhere, particularly over North America (Fig. 4e). Precipitation RMSESS values become positive in most regions, too (Fig. 7e). The only exceptions are the Himalaya region and most of the northeast of Russia.

So far, we have not excluded any proxies from the data
assimilation. We trust that proxies with no or a weak climate
signal simply have regression coefficients close to zero and large
residuals. This way they hardly affect the analysis. However, in
a regression model with 12 independent variables and only 70 years
of overlapping data, some records may just by chance get more
weight than they deserve. Therefore, our next step is the
introduction of a weak screening. In a first step, we only
assimilate proxies with

This result appears to be good, but this could also be a result of overfitting the regression model because any combination of growing season months was allowed to affect tree growth. It would not make physiological sense if a tree were limited, for instance, by May, July and September temperatures but not by June and August temperatures. Hence, the next step is to further constrain the model. The tree growth should be affected by climate conditions in a locally varying growing season of consecutive months. We fit all possible combinations of temperature and precipitation influences in consecutive months and choose the model with the lowest AIC (note that additionally duplicates are removed; experiment ALL_TR_scr0.05_AIC_NOdup). As a result of this more physically based growth model, reconstruction skill decreases slightly in some regions with a high number of paleodata such as in parts of China and parts of North America (Figs. 6g and 7g). Because we only identify a few duplicate records, this suggests that the previously noted improvement in RMSESS was indeed partly due to overfitting. Nevertheless, temperature and precipitation correlations remain on the same high level everywhere (Figs. 3g, 4g). Sea-level correlation changes are still small and negative in China and on the west coast of North America (Fig. 5g).

Precipitation RMSESS skill score, where greens colors indicate an improvement of the analysis.

Recently, Valler et al. (2019) were able to show that major improvements of the method used in this study can be achieved by using a background error covariance matrix, which is not only calculated from the 30 ensemble members for the current year (Franke et al. 2017a) but blended with a climatological error covariance matrix based on random years and ensemble members from the original model simulations (see the “Data and methods” section; experiment ALL_TR_scr0.05_AIC_NOdup_ClimCovar). Using improved covariance information increases RMSESS values again, and a much smaller number of grid boxes with negative skill remains. Moreover, the largest effects of the better error covariance estimation appear in variables that have not been assimilated such as sea-level pressure (Fig. 5h). This is very important because one of the reasons for using data assimilation instead of traditional statistical reconstruction techniques is the possibility of gaining knowledge about further variables in a physically consistent way, which allows for a better dynamic interpretation of the identified climatic variations.

Correlations of the reconstructions with temperature improved as would be expected after the assimilation of the three data sets and using a temperature PSM. We calculate the regression coefficients based on instrumental temperature. Hence, all proxies that correlate in some way with instrumental temperature will be used to update the analysis temperature. The analysis has the highest correlation improvements with instrumental temperature if the proxies themselves have the highest correlations, which is the case for the NTREND data set with the best temperature proxies only. Correlation improvements are lower but cover a larger area with the B14 collection.

Note that correlation improvements can be a result of a negative relationship between tree-ring width and instrumental temperature if local growth is moisture-limited and growing season temperature and precipitation are negatively correlated. This can be a benefit because through the covariance we use the extra information that dry summers are also warm and vice versa. Hence, we find much better precipitation correlation with the B14 collection than with the NTREND data set. However, using moisture-sensitive trees to update temperature fields may cause problems. Precipitation variability shows high interannual variability in many locations but neither the same interdecadal to multidecadal variability as temperature nor its centennial trend (Hartmann et al., 2013; Landrum et al., 2013). Although not an issue addressed in this work, another study suggests that including the unscreened B14 records and modeling them using a similar approach as presented herein (including both temperature and moisture influences) can lead to problems in the representation of longer than interannual scales in temperature reconstructions (Tardif et al., 2019).

The regression model is calibrated on the interannual timescale assuming that TRW limitations are time-independent. However, this may not be the case (Babst et al. 2019), and therefore decadal-to-multidecadal variability may be less well represented. A similar argument holds for the update introduced by the model covariance matrix, which, although state-dependent, may yield optimal estimates only for seasonal and not decadal timescales. However, our approach avoids these pitfalls in two ways. First, at multidecadal and longer timescales, the model takes over, and therefore relations in our reconstructions are not constrained to be stationary across timescales. Furthermore, with our approach, the stationarity assumption is restricted to the regression model; thus it is a local stationarity – no further stationarity assumption concerning spatial variability is introduced except for experiment ALL_TP_scr0.05_AIC_NOdup_ClimCovar, where 50 % of the background error covariance matrix is climatological and thus stationary. Most other approaches assume stationary spatial covariances.

Theoretically, it would be optimal to assimilate all available data and let each record be weighted based on its error. However, the true observation error is unknown and its estimation is uncertain. In our case, we use a multiple regression proxy system model with 6 or 12 variables (6 months of temperature and optionally 6 months of precipitation) in a 70-year period of overlapping instrumental data and proxy measurements to estimate regression coefficients. This rather short period and large number of independent variables can lead to overfitting the model and thus underestimating the observation error, which is defined by the regression residuals. Together with the low signal-to-noise ratio of many tree-ring chronologies, this can lead to an over- or under-correction of the model field in the assimilation step. An additional experiment with doubled observation error (not shown) increases RMSESS values clearly. This suggests that PSM overfitting and consequently regression residuals that are too small are part of the reason for the negative RMSESS skill scores in the B14_T experiment in contrast to the NTREND_T experiment (Fig. 4a and c).

In the following experiments (ALL_TR_scr0.05,
ALL_TR_scr0.05_AIC_NOdup,
ALL_TR_scr0.05_AIC_NOdup_ClimCovar) we tried to
reduce the consequences of uncertain error estimates step by step.
Excluding proxies without a significant climate signal (

In the experiment, in which we only allow for a single growing season (ALL_TR_scr0.05_AIC_NOdup) per year instead of a statistically optimal selection of months and remove duplicate records that are in more than one of the data collections, correlations improve slightly but RMSESS decreases slightly. Obviously, we continue with this more realistic setup, but note that the choice of what is “best” depends on the chosen statistic or the reconstruction characteristics that are desired by the user. For instance, correlation just measures covariance, whereas RMSESS is based on squared errors and hence penalizes especially large biases; i.e., it favors an underestimation of variability over an overestimation.

Finally, we introduce an improved background error covariance estimation scheme (ALL_TR_scr0.05_AIC_NOdup_ClimCovar, Valler et al. 2019). Because assimilated information is spread in space and in between variables through the covariance matrix, it is important to estimate covariances well. Estimating covariance from both the 30 members at the current time step and from climatology and then blending both kinds of information especially improves our results for variables which have not been assimilated such as sea-level pressure (Fig. 5h).

In reality, climate signals in tree-ring proxies may be even more complicated than a function of moisture availability and growing season temperature. Limiting factors may change over time (Babst et al., 2019) or light availability may be important may and not always be highly correlated with temperature; i.e., more diffuse light after volcanic eruptions may stimulate growth (Stine and Huybers, 2014). More sophisticated proxy system forward models such as VS-Lite (Tolwinski-Ward et al., 2011) could be used in data assimilation (Acevedo et al., 2016; Dee et al., 2016). In fact, we have applied VS-lite to all TRW records in B14 (Breitenmoser et al., 2014). Although these models are more realistic and represent for instance non-linear responses, they introduce new problems mainly related to model biases. This currently prevents them from being used more broadly (Dee et al., 2016).

Finally, we tested the order of assimilated data because we assimilate data serially. In combination with using covariance localization, the order could influence the final reconstruction (Greybush et al., 2011). Assimilating the data from the best to worst record in terms of regression residuals and in opposite order from worst to best hardly influenced correlation and RMSESS skill scores (not shown). Hence, we continue to assimilate records starting with the best ones, similar to traditional reanalysis, which sorts observations from the largest to smallest expected variance reduction in the reanalysis (Slivinski et al., 2019; Whitaker et al., 2008).

Although our results are specifically valid only for the data assimilation method used herein, it is likely that methods with a similar structure, i.e., PSMs and variations of Kalman filters, will have similar sensitivities to the selection of input data (Tardif et al., 2019). We expect that they are even valid for most climate field reconstruction techniques because the basic principles of transferring proxy information to climatic variables and dealing with errors share common concepts across these methods. Even though all such methods include some routines to separate climatic information from non-climatic noise, in practice results can almost always be improved by pre-selecting the records with the highest information content, independent of the reconstruction technique applied (Neukom et al., 2019a, b; Smerdon and Pollack, 2016 and references therein). This suggests that our results are qualitatively transferrable to climate field reconstruction methods in general.

In this study, we use existing proxy data collections to generate climate field reconstructions, as is common practice. We are aware that this is not, in all cases, the main aim for which these data collections were compiled. Hence, we want to highlight the consequences of using the data set for field reconstructions. These results are not meant to rank any data set above another. Disadvantages of a data set in our setup are most probably the result of unintended usage.

How to choose input data for paleodata assimilation? We address this question by comparing three paleodata compilations of different sizes as well as using all data sets together in combination with various screening approaches.

Just using a large collection of proxy data (B14) does not lead to a skillful reconstruction. In contrast, just using a small expert selection of the best temperature proxies (NTREND) leads to a good high-latitude temperature reconstruction but wastes the potential of modern data assimilation techniques to reconstruct the four-dimensional multi-variate state of the atmosphere. However, simply combining all available input data and leaving the weighting completely to a statistical model does not lead to optimal results either. Rejecting records without a clear climatic signal, removing duplicates and using a physically plausible PSM altogether lead to a better reconstruction.

Hence the answer to our research question of whether it is better to
assimilate all available proxy data or just the best expert
selection has to be answered as follows: neither of the two is
optimal. We achieve the best results in terms of correlation and
RMSESS if we use a large collection of proxy records. However,
to make proper use of input data which were not screened by
experts it is crucial to

use proxy system models that properly represent the paleodata, here taking possible temperature and moisture limitations of tree growth into account.

use correct physical assumptions, in our case about tree growth, to avoid statistical overfitting.

remove input data with random, not significant climate signals.

care about overfitting (underestimation of errors)

The CCC400 model simulations and the EKF400 reanalysis upon which the sensitivity experiments were performed are available at the World Data Center for Climate at Deutsches Klimarechenzentrum (DKRZ) in Hamburg, Germany (

JF had the initial idea for this paper and performed most of the analysis and drafted the manuscript. VV contributed to the code development. SB helped to shape the manuscript and experimental design. RN contributed additional analysis and all authors provided critical feedback and contributed to the writing of the manuscript.

The authors declare that they have no conflict of interest.

We would like to thank CSCS for their support in conducting the ECHAM simulations.

This research has been supported by the Swiss National Science Foundation (grant no. 162668 RE-USE) and the EU ERC project (grant no. 787574 PALAEO-RA) and the Swiss National Science Foundation (NSF; grant PZ00P2_154802).

This paper was edited by Keely Mills and reviewed by Edward Cook and two anonymous referees.