The approach to time series reconstruction in climatology based upon cross-correlation coefficients and regression equations is mathematically incorrect because it ignores the dependence of time series upon their past. The proper method described here for the bivariate case requires the autoregressive time- and frequency domains modeling of the time series which contains simultaneous observations of both scalar series with subsequent application of the model to restore the shorter one into the past. The method presents further development of previous efforts taken by a number of authors starting from A. Douglass who introduced some concepts of time series analysis into paleoclimatology. The method is applied to the monthly data of total solar irradiance (TSI), 1979–2014, and sunspot numbers (SSN), 1749–2014, to restore the TSI data over 1749–1978. The results of the reconstruction are in statistical agreement with observations.

An important task in climatology and paleoclimatology consists in the reconstruction of a time series of some variable over the time interval when that variable was not measured. This task is solved by using proxy data – observations of a different variable, or variables supposed to be closely related to the variable of interest during the time interval of interest. A typical example would be restoring the annual surface temperature over the past centuries using dendrochronology data – time series of annual tree ring widths within the geographical area of interest. The observation data over the time interval when both variables (tree rings as the proxy and temperature as the variable to be restored) have been properly measured, are analysed and the relation between them is used to reconstruct the temperature time series into a more or less distant past, depending on the amount of tree rings observations. Quite often, the mathematical tool applied for this purpose is the linear regression analysis. If the estimate of the cross-correlation coefficient between the time series of the proxy variable and the variable that is being reconstructed on the basis of the available simultaneous observations is high, a regression equation is built and the missing past values of temperature are reconstructed on the basis of that equation. This is how it is done both in the simplest bivariate case (a proxy and the variable to be restored) and in the multivariate case when the variable of interest is reconstructed on the basis of a multivariate linear regression equation (e.g., Bradley, 2015; Santos et al., 2015). The variables can be transformed in some way before the reconstruction (for example, time series of principal components of expansions into empirical orthogonal functions are used instead of the original data, see Tingley et al., 2012) but the general principle remains the same: build a regression equation.

Yet, this cross-correlation/regression approach is generally not correct for analysing multivariate time series. Their statistical properties cannot be understood and the missing past data should not be reconstructed without a more sophisticated analysis than just through cross-correlation coefficients and regression equations. The key factor that makes time series behave in a more complicated manner is their dependence upon time and, consequently, upon frequency, which does not exist in the case of random variables for which a cross-correlation coefficient and a regression equation are exhaustive. Generally, consecutive values of time series depend upon their past and the relationship between the scalar components of a multivariate time series depends upon past values of all of its components. The time domain properties define the time series' properties in the frequency domain, and their study allows one to obtain additional information about relations between the scalar components of a multivariate time series.

The goals of this study are to show how to

analyze a multivariate time series in time and frequency domains to obtain and interpret the information necessary for reconstructing one of the time series' components into the past and

apply the results of this analysis to reconstruct past values of the time series on the basis of observations made during a relatively short and recent time interval.

Section 2 contains some historical notes, Sect. 3 describes the mathematical approach used in the paper; it is based upon autoregressive time- and frequency domains analysis of multivariate (in our case, bivariate) time series. Section 4 provides an example with actual bivariate data (the data description and steps to be taken to reconstruct the time series). The methodology and results are summed up in Sect. 5, which also contains some practical recommendations.

Seemingly the first effort to reconstruct a climatic time series was
made by the founder of the science of dendrochronology A. Douglass who
suggested “a mathematical formula for calculating the growth of trees
when the rainfall is known” and vice versa (Douglass, 1919). His
studies of tree rings growth and climate dependence upon each other
and upon sunspot numbers include important achievements such as

discovering and analyzing dependence between time series of tree-rings growth and sunspot numbers (Douglass, 1909, p. 228; Douglass, 1928),

suggesting an extended memory (autoregressive) type of model for the time series of precipitation (1919, p. 68),

regarding the sunspot – tree rings system as inertial (Douglass, 1936),

noting that the correlation coefficient may not properly reflect the dependence between time series (“The similarity between two trees curves …is only partly expressed by a correlation coefficient.” Douglass, 1936, p. 29),

studying time series in the frequency domain by using the Schuster periodogram (Douglass, 1919, pp. 86–110).

The first analyses that take into account the behavior of time series of climate and tree-rings in both time and frequency domains through correlation functions, spectra, and coherence functions and describe the response of tree-growths to climatic factors were conducted by Fritts (1976). Concepts of response functions “to describe the tree-ring response to variation in climate” and transfer function, “which transforms values of ring width into estimates of climate…”, were also introduced, adverse effects of filtering noted but no explicit time- or frequency domains models was suggested. A frequency domain description of tree-ring and climate data through coherence function estimates was also given by Guiot (1982).

Probably, the first example of building an explicit time-domain model was presented by Guiot (1985) who used a set of “mutually exclusive” linear filters to split the entire frequency range of the data into separate frequency bands, obtained a regression equation for each band and then combined them into a single time-domain equation connecting temperature to tree-rings.

Guiot (1986) introduced the concept of parametric time domain models into paleoclimatology and used scalar ARMA models and/or regression equation to estimate the transfer function. The reconstruction quality was estimated on the basis of correlation coefficients with an “optimal” proxy data set.

More efforts were undertaken later to apply methods of time series analysis in paleoclimatology, including the use of the Kalman filter (Visser and Molenaar, 1988) as well as applications of the Bayesian approach to climate reconstruction (e.g., von Storch et al., 2004; Hasslett et al., 2006; Tingley and Hubert, 2010).

Though the correlation/regression approach still seems to prevail in paleoclimatology, our approach based upon an explicit time-domain model of the tree-rings–climate system in the form of a bivariate stochastic difference equation including system's description in the frequency domain should be regarded as an improvement of methods suggested by previous authors starting from the founder of dendrochronology A. Douglass.

The basic difference between random variables and random functions had
been revealed almost 60

Monographs and papers on methods of multiple time series analysis including estimation of the coherence function started to appear in the 1960's and are well-known in random processes (Bendat and Piersol, 1966), in geophysics (Robinson, 1967), and in econometrics (Granger and Hatanaka, 1964; Granger, 1969). None of those methods relies upon cross-correlation coefficients and regression equations.

Consider now how the linear regression model

In a matrix form, Eq. (

The stochastic difference Eq. (

The properties of the time series

In particular, the coherence squared function is found from the matrix
Eq. (

The coherence function

Other functions of frequency that describe relations between time series, such as coherent spectra and frequency response functions (e.g., Bendat and Piersol, 2010), will not be used in this article.

The time domain model Eq. (

The task of fitting a proper autoregressive model to a bivariate time
series is discussed, for example, in Box et al. (2015), while some
recommendations for the case of climate data analysis can be found in
Privalsky (2015). A key point in the parametric time series analysis
is choosing a proper order

The following example with actual observations – sunspot numbers and
total solar irradiance of the Earth – demonstrates, among other
things, that the linear regression approach to reconstructing past
data is generally not correct. Specifically, it would not be proper to
reconstruct past values

Examples of TSI reconstruction on the basis of linear regressions can be found, for example, in Fröhlich (2009) or in Steinhilber (2009), but it should be stressed that we are discussing here mostly the method of reconstruction rather than which proxy should be used for it.

Consider the task of restoring past values of the total solar
irradiance (TSI)

Both processes are dominated by the 11 year cycle but also show
variability at smaller time scales. The autoregressive estimates of
the TSI and SSN spectra are shown in Fig. 2. The optimal AR orders for
the scalar time series models are

Consider first the traditional approach: using the linear regression
Eq. (

If

As both SSN and TSI present time series rather than random variables,
the values of TSI for the time interval from 1749 through 1978 should
be reconstructed by using a bivariate stochastic model Eq. (

In studies dedicated to reconstruction of climate and to
teleconnections in the Earth system, the statistical reliability of
estimated cross-correlation coefficients seems to be determined
without taking into accounts three important factors:

the variance of cross-correlation coefficient estimates depends upon the behavior of the entire correlation and cross-correlation functions of the time series (see Bendat and Piersol, 2010; Box et al., 2015); besides, the maximum absolute value of the cross-correlation function does not necessarily occur at zero lag between the time series (e.g., Fig. 3) and even if it does, one cannot ignore high cross-correlations at other lags;

if several cross-correlation coefficients are estimated, the probability of obtaining a spuriously high value increases with the number of estimates; this had been proved long ago by none other than the founder of the modern probability theory (Kolmogorov, 1933); it means, in particular, that selecting “statistically significant” predictors out of a set of possible predictors on the basis of “statistically significant” cross-correlation coefficients between the predictors and the predictand(s) may lead to spurious results;

in the “moving interval correlation analysis” (e.g., Maxwell et al., 2015), consecutive estimates of cross-correlation coefficients are strongly dependent on each other and this makes the estimates' variance to increase.

Returning to the data analysis, the optimal time domain AR
approximation for the bivariate time series

According to Eq. (

As the variances of TSI and SSN differ by several orders of magnitude,
the AR coefficients in Eq. (

The bivariate stochastic difference Eq. (

The knowledge of the stochastic difference Eq. (

The values of the coherence function between SSN and TSI, which has
been obtained from the spectral matrix corresponding to
Eqs. (

A seemingly obvious way to obtain past monthly values of TSI back to
January 1749 would be to simulate TSI in accordance with
Eq. (

Therefore, the past values of TSI should be reconstructed starting
from the earliest observation date of SSN, that is, from
January 1749. It means using the first of the Eq. (

The differences between the estimates of the mean values and between
the variance estimates for the observed (1979–2014,

As seen from Fig. 6, the agreement between the spectra of observed
(1979–2014) and restored TSI data (1749–1978) is quite
satisfactory. Note that though the cross-correlation coefficient
between SSN and the reconstructed TSI is less than 1, the coherence
between them (not shown) equals 1 at all frequencies because,
according to Eq. (

To further estimate these differences in reconstructions, consider the
results obtained for the interval from 1979 through 2014 over which
the values of TSI are known from observations. First, according to
Eq. (

Comparing the spectral density of the observed TSI with those of the
two reconstructed time series (shown in Fig. 7 for the lower frequency
band where the spectral energy is high), one can see that

the TSI spectrum obtained through regression is mostly
negatively biased with respect to the spectrum of TSI obtained
through Eq. (

this spectrum (which, according to Eq. (11), is identical to the SSN spectrum up to a multiplier) differs from the spectrum of the observed TSI.

A more spectacular results would be obtained if one were to restore
the contribution of El Niño – Southern Oscillation (ENSO) to,
say, the global surface temperature (GST), or the Atlantic
Multidecadal Oscillation (AMO). In those cases, the correlation
coefficient between GST and ENSO or between AMO and ENSO would be very
close to zero (

The main goal of this study was to show that the task of
reconstructing past values of a bi-variate time series on the basis of
simultaneous observations of its components during a relatively short
time interval should be treated within the framework of time series
analysis. This is done in the following manner:

build and analyze an autoregressive model of the bivaraite time series in the time and frequency domains,

use the model to simulate the missing time series component into the past starting from the earliest observation of the proxy data and substituting the known proxy data at each step into the difference equation for the unknown time series,

verify that basic statistical properties of the reconstructed component do not differ much from the properties known from observations.

This approach based upon time series analysis and upon previous research in paleoclimatology was applied here to the time series containing monthly values of the total solar irradiance of the Earth (TSI) measured during the interval from 1979 through 2014 and the sunspot numbers observed from 1749 through 2014 to produce an estimate of monthly TSI values from 1749 through 1978.

On the whole, it can be said that the statistical properties of the reconstructed TSI data such as its variance and spectral density do not disagree with respective properties of the observed TSI and that the time series approach produced better results than the regression-based reconstruction.

This approach to reconstruction is recommended for all cases when the spectra of the time series components differ from a constant (white noise) and/or from each other and when the cross-correlation function between the components contains more than just one statistically significant value.

It must be also stressed that the autoregressive model introduced here emerges as a natural extension of the linear regression equation for the case of multivariate random functions. In particular, it means that the use of the moving average (MA) or mixed autoregressive – moving average (ARMA) models would be illogical in such cases.

The authors are grateful to F. Clette for providing the sunspot time series and commenting on it and to J. Guiot for his important comments and suggestions. A. Gluhovsky acknowledges support from the National Science Foundation under Grant no. AGS-1 050 588.