The skill of the state-of-the-art climate field reconstruction technique BARCAST (Bayesian Algorithm for Reconstructing Climate Anomalies in Space and Time) to reconstruct temperature with pronounced long-range memory (LRM) characteristics is tested. A novel technique for generating fields of target data has been developed and is used to provide ensembles of LRM stochastic processes with a prescribed spatial covariance structure. Based on different parameter setups, hypothesis testing in the spectral domain is used to investigate if the field and spatial mean reconstructions are consistent with either the fractional Gaussian noise (fGn) process null hypothesis used for generating the target data, or the autoregressive model of order 1 (AR(1)) process null hypothesis which is the assumed temporal evolution model for the reconstruction technique. The study reveals that the resulting field and spatial mean reconstructions are consistent with the fGn process hypothesis for some of the tested parameter configurations, while others are in better agreement with the AR(1) model. There are local differences in reconstruction skill and reconstructed scaling characteristics between individual grid cells, and the agreement with the fGn model is generally better for the spatial mean reconstruction than at individual locations. Our results demonstrate that the use of target data with a different spatiotemporal covariance structure than the BARCAST model assumption can lead to a potentially biased climate field reconstruction (CFR) and associated confidence intervals.

Proxy-based climate reconstructions are major tools in understanding the past climate system and
predicting its future variability. Target regions, spatial
density and temporal coverage of the proxy network vary between the studies,
with a general trend towards more comprehensive networks and sophisticated
reconstruction techniques used. For example,

The concept of pseudo-proxy experiments was introduced after millennium-long
paleoclimate simulations from general circulation models (GCMs) first became available, and has been
developed and applied over the last 2 decades

Available pseudo-proxy studies have to a large extent used target data from
the same GCM model simulations, subsets of the same spatially distributed
proxy network and a temporally invariant pseudo-proxy network

Additionally, we test the reconstruction skill on an ensemble member basis
using standard metrics including the correlation coefficient and the
root-mean-squared error (RMSE). The continuous ranked probability score
(CRPS) is also employed; this is a skill metric composed of two subcomponents
recently introduced for ensemble-based reconstructions

Temporal dependence in a stochastic process over time

For the instrumental time period, studies have shown that detrended local and
spatially averaged surface temperature data exhibit LRM properties on timescales from months up to decades

Our basic assumption is that the background temporal evolution of Earth's
surface air temperature can be modeled by the persistent Gaussian stochastic
model known as the fractional Gaussian noise (fGn; Chapter 1 and 2 in

The fGn model is appropriate for many observations of SAT data, but there are
also some deviations. The theoretical fGn follow a Gaussian distribution, but
for instrumental SAT data the deviation from Gaussianity varies with latitude

Since the target data are represented as an ensemble of independent members generated from the same stochastic process, there is little value in estimating and analyzing ensemble means from the target and reconstructed time series themselves. Anomalies across the ensemble members will average out, and the ensemble mean will simply be a time series with non-representative variability across scales. Instead we will focus on averages in the spectral sense. The median of the ensemble member-based metrics are used to quantify the reconstruction skill.

The reconstruction method to be tested, Bayesian Algorithm for
Reconstructing Climate Anomalies in Space and Time (BARCAST), is based on a
Bayesian hierarchical model

A particular advantage of BARCAST as a probabilistic reconstruction technique
lies in its capability to provide an objective error estimate as the result
of generating a distribution of solutions for each set of initial conditions.
The reconstruction skill of the method has been tested earlier and compared
against a few other climate field reconstruction
(CFR) techniques using pseudo-proxy experiments.

In the following, we describe the methodology of BARCAST and the target data generation in Sect. 2. The spectral estimator used for persistence analyses is also introduced here. Sect. 3 is comprised of an overview of the experiment setup and explains the hypothesis testing procedure. Results are presented in Sect. 4 after performing hypothesis testing in the spectral domain of persistence properties in the local and spatial mean reconstructions. The skill metric results are also summarized. Finally, Sect. 5 discusses the implications of our results and provides concluding remarks.

BARCAST is a climate field reconstruction method, described in detail in

The spatial

On the data level, the observation equations for the instrumental and proxy data are

The remaining level is the prior. Weakly informative but proper prior
distributions are specified for the scalar parameters and the temperature
field for the first year in the analysis. The priors for all parameters
except

The Metropolis-coupled MCMC algorithm is run for 5000 iterations, running
three chains in parallel. Each chain is assumed equally representative for
the temperature reconstruction if the parameters converge. There are a number
of ways to investigate convergence, for instance one can study the
variability in the plots of draws of the model parameters as a function of
step number of the sampler, as in

There are numerous reasons why the parameters may fail to converge, including inadequate choice of prior distribution and/or hyperparameters or using an insufficient number of iterations in the MCMC algorithm. It may also be problematic if the spatiotemporal covariance structure of the observations or surrogate data deviate strongly from the model assumption of BARCAST.

BARCAST was used to generate an ensemble of reconstructions, in order to achieve a mean reconstruction as well as uncertainties. In our case, the draws for each temperature field and parameter are thinned so that only every 10 of the 5000 iterations are saved; this secures the independence of the draws.

The output temperature field is reconstructed also in grid cells without
observations; this is a unique property compared to other well-known field
reconstruction methods such as the regularized expectation maximum technique
(RegEM) applied in

While generating ensembles of synthetic LRM processes in time is straightforward using statistical software packages, it is more complicated to generate a field of persistent processes with prescribed spatial covariance. Below we describe a novel technique that fulfills this goal, which can be extended to include more complicated spatial covariance structures. Such a spatiotemporal field of stochastic processes has many potential applications, both theoretical and practical.

Generation of target data begins with reformulating Eq. (

The stabilizing term

Summations over time steps

If we for convenience omit the spatial index

The temporal dependencies in the reconstructions are investigated to obtain
detailed information about how the reconstruction technique may alter the
level of variability on different scales, and how sensitive it is to the
proxy data quality. Persistence properties of target data, pseudo-proxies and
the reconstructions are compared and analyzed in the spectral domain using
the periodogram as the estimator. See Appendix

Power spectra are visualized in log–log plots since the spectral exponent
then can be estimated by a simple linear fit to the spectrum. The raw and
log-binned periodograms are plotted, and

It is also possible to use other estimators for scaling analysis, such as the
detrended fluctuation analysis (DFA;

The experiment domain configuration is selected to resemble that of the
continental landmass of Europe, with

Pseudo-instrumental data cover the entire reconstruction region for the time
period 850–1000 and are identical to the noise-free values of the true target
variables. The spatial distribution of the pseudo-proxy network is highly
idealized as illustrated in Fig.

The spatial domain of the reconstruction experiments. Dots mark locations of instrumental sites, proxy sites are highlighted by red circles. The superimposed map of Europe provides a spatial scale.

Our set of experiments is summarized in Table

Summary of the experiment setup.

Hypothesis testing in the spectral domain is used to determine which
pseudo-proxy/reconstructed data sets can be classified as fGn with the
prescribed scaling parameter, or as AR(1) with parameter

The second null hypothesis tested is that the data can be described as an
AR(1) process at all frequencies, with the parameter

Mean raw and log-binned PSD for pseudo-proxy data (blue curve and asterisks, respectively) and reconstruction at
the same site (red curve and dots, respectively) generated from

Figure

The two null hypotheses provide no restrictions on the normalization of the fGn and AR(1) data used to generate the Monte Carlo ensembles. In particular, they do not have to be standardized in the same manner as the pseudo-proxy/reconstructed data. This makes the experiments more flexible, as the confidence range of the Monte Carlo ensemble can be shifted vertically to better accommodate the data under study. A standard normalization of data includes subtracting the mean and normalizing by the standard deviation. This was sufficient to support the null hypotheses in many of our experiments. A different normalization had to be used in other experiments.

BARCAST successfully estimates posterior distributions for all reconstructed
temperature fields and scalar parameters. Convergence is reached for the
scalar parameters despite the inconsistency of the input data temporal
covariance structure with the default assumption of BARCAST. Table

Further results concern the spectral analyses and skill metrics. All references to spectra in the following correspond to mean spectra. Analyses of the reconstruction skill presented below are performed on a grid point basis, and for the correlation and RMSE also for the spatial mean reconstruction. While the latter provides an aggregate summary of the method's ability to reproduce specified properties of the climate process on a global scale, the former evaluates BARCAST's spatial performance.

The scaling properties of the input data have already been modified when the target
data are perturbed with white noise to generate pseudo-proxies. The power
spectra shown in blue in Fig.

Hypothesis testing was performed in the spectral domain for the field
reconstructions, with the two null hypotheses formulated as follows:

The reconstruction is consistent with the fGn structure in the target data for all frequencies.

The reconstruction is consistent with the AR(1) model used in BARCAST for all frequencies.

Table

Hypothesis testing results for local reconstructed data compared to Monte Carlo ensembles of fGn and AR(1) processes. The “x” mark in the table indicates that the null hypothesis cannot be rejected. Null hypotheses are 1: the reconstruction is consistent with the fGn structure in the target data for all frequencies; 2: the reconstruction is consistent with the AR(1) assumption from BARCAST for all frequencies.

Mean raw and log-binned PSD for local reconstructed data at a site between proxies (gray curve and dots,
respectively) generated from

The hypothesis testing results vary moderately between the individual grid
cells. PSD analyses of the local reconstructions using the same

Mean raw and log-binned PSD for the spatial mean reconstruction (gray curve and dots, respectively), generated
from

The spatial mean reconstruction is calculated as the mean of the local
reconstructions for all grid cells considered, weighted by the areas of the
grid cells. The reconstruction region considered is
37.5–67.5

Hypothesis testing results for spatial mean reconstructed data compared to Monte Carlo ensembles of fGn
and AR(1) processes. The null hypotheses 1 and 2 are the same as in Table

The power spectra can also be used to gain information about the fraction of
variance lost/gained in the reconstruction compared with the target. This
fraction is in some sense the bias of the variance, and was found by
integrating the spectra of the input and output data over frequency. The
spatial mean target/reconstructions were used, and the mean log-binned
spectra. The total power in the spatial mean reconstruction and the target
were estimated, and the ratio of the two provides the under/overestimation of
the variance:

Log–log plot showing log-binned power spectra of spatial mean target (blue) and reconstruction (red)
for one experiment. Vertical gray lines mark the frequency ranges used to estimate bias of variance
as referred to in Sect.

Local correlation coefficient between reconstructed temperature field and target field for the
verification period (ensemble mean). The box plots left of the color bars indicate the distribution of grid
point correlation coefficients.

Local RMSE between reconstructed temperature field and target field for
the verification period. The box plots left of the color bars indicate the distribution of grid point
correlation coefficients.

Local

It is common practice in paleoclimatology to evaluate reconstruction skill
using metrics such as the Pearson's correlation coefficient

Figures

Figure

Table

Table

In this study we have tested the capability of BARCAST to preserve temporal LRM properties of reconstructed data. Pseudo-proxy and pseudo-instrumental data were generated with a prescribed spatial covariance structure and LRM temporal persistence using a new method. The data were then used as input to the BARCAST reconstruction algorithm, which by design use an AR(1) model for temporal dependencies in the input/output data. The spatiotemporal availability of observational data was kept the same for all experiments in order to isolate the effect of the added noise level and the strength of persistence in the target data. The mean spectra of the reconstructions were tested against the null hypotheses that the reconstructed data can be represented as LRM processes using the parameters specified for the target data, or as AR(1) processes using the parameter estimated from BARCAST.

We found that despite the default assumptions in BARCAST, not all local and
spatial mean reconstructions were consistent with the AR(1) model. Figures

All spatial mean reconstructions are consistent with the fGn process null hypothesis
according to Table

The power spectra in Figs.

In addition to BARCAST, other reconstruction techniques that may experience
similar deficiencies for LRM target data are the regularized
expectation–maximization algorithm (RegEM;

When an incorrect statistical model is used to reconstruct a climate signal,
the temporal correlation structure is likely to be deteriorated in the
process. For the range of different reconstructions available, such effects
may contribute to discussions on a number of questions under study, including
the possible existence of different scaling regimes in paleoclimate; see

The criteria for the hypothesis testing used in this study are strict, and
may be modified if reasonable arguments are provided. For example, if the
first null hypothesis used here was modified so that only the low-frequency
components of the spectra were required to fall within the confidence ranges,
more of the reconstructions would be consistent with the fGn model. However,
from studying the spectra in Figs.

The skill metrics used to validate the reconstruction skill are the Pearson's
correlation coefficient

The spectral shape of the input pseudo-proxy data plotted in blue in Fig.

Median local skill measurements

Our study further suggests that for a proxy network of high quality and density, exhibiting LRM properties, the BARCAST methodology
is capable, without modification, of constructing skillful reconstructions with LRM preserved across the region. This is because the
data information overwhelms the vague priors. The availability of well-documented proxy records therefore helps the analyst select
an appropriate reconstruction method based on the input data. For quantification and assessment of real-world proxy quality,
forward proxy modeling is a powerful tool that models proxy growth/deposition instead of the target variable evolution, also
taking known proxy uncertainties and biases into consideration. See for example

Several extensions to the presented work appear relevant for future studies,
including (a) implementing external forcing and responses to this forcing
in the target data to make the numerical experiments more realistic, (b) generating target data using a more complex model than described in
Sect.

The alternatives (a) and (b) can be implemented together. Relevant
advancements for target data generation can be obtained using the class of
stochastic–diffusive models, such as the models described in

For the BARCAST CFR methodology, reformulation of model Eqs. (1)–(2) would
drastically improve the performance in our experiments. However, at present
we cannot guarantee that modifications favoring LRM are practically feasible
in the context of a Bayesian hierarchical model, due to higher computational
demands. Changing the AR(1) model assumption to instead account for LRM would
in the best scenario slow the algorithm down substantially, and in the worst
scenario it would not converge at all. Some cut-off timescale would have to
be chosen to ensure convergence. Regarding the spatial covariance structure,
accounting for teleconnections introduces similar computational challenges.
The more general Matérn covariance family form

The pseudo-proxy study presented here sets a powerful example for how to construct and utilize an experimental structure to isolate specific properties of paleoclimate reconstruction techniques. The generation of the input data requires far less computation power and time than for GCM paleoclimatic simulations, but also results in less realistic target temperature fields. We demonstrate that there are many areas of use for these types of data, including statistical modeling and hypothesis testing.

Data and codes are available on request, including BARCAST code package.

The periodogram is defined here in terms of the discrete Fourier transform

For probabilistic forecasts, scoring rules are used to measure the forecast
accuracy, and proper scoring rules secure that the maximum reward is given
when the true probability distribution is reported. In contrast, the
reduction of error (RE) and coefficient of efficiency (CE) are improper
scoring rules, meaning they measure the accuracy of a forecast, but the
maximum score is not necessarily given if the true probability distribution
is reported. For climate reconstructions, RE

The concept behind the CRPS is to provide a metric of the distance between
the predicted (forecasted) and occurred (observed) cumulative distribution
functions of the variable of interest. The lowest possible value for the
metric corresponding to a perfect forecast is therefore CRPS

In the case of a reconstruction ensemble at each spatial location

For the average over

It can be demonstrated that the spatially and/or temporally averaged CRPS can
further be broken into two parts: the average reliability score metric
(

Equation (

Both scores are given in the same unit as the variable under study, here surface temperature.

The forms of the prior PDFs for the scalar parameters in BARCAST are
identical to those used in

The parameter values prescribed for the target data are listed in Table

The mean of the posterior distributions of the BARCAST parameters

List of parameters defined in BARCAST, form of prior and hyperparameters.

List of parameter values defined for the target data set. The four values of

Mean of posterior distribution for each parameter.

The authors declare that they have no conflict of interest.

Tine Nilsen was supported by the Norwegian Research Council (KLIMAFORSK programme) under grant no. 229754, and partly by Tromsø Research Foundation via the UiT project A31054. Dmitry V. Divine was partly supported by Tromsø Research Foundation via the UiT project A33020. Johannes P. Werner gratefully acknowledges support from the Centre for Climate Dynamics (SKD) at the Bjerknes Centre. Dmitry V. Divine, Tine Nilsen and Johannes P. Werner also acknowledge the IS-DAAD project 255778 HOLCLIM for providing travel support. The authors would like to thank Kristoffer Rypdal for helpful discussions and comments. Edited by: Jürg Luterbacher Reviewed by: three anonymous referees