A statistical framework for evaluation of climate model simulations by comparison with climate observations from instrumental and proxy data (part 1 in this series) is improved by the relaxation of two assumptions. This allows autocorrelation in the statistical model for simulated internal climate variability and enables direct comparison of two alternative forced simulations to test whether one fits the observations significantly better than the other. The extended framework is applied to a set of simulations driven with forcings for the pre-industrial period 1000–1849 CE and 15 tree-ring-based temperature proxy series. Simulations run with only one external forcing (land use, volcanic, small-amplitude solar, or large-amplitude solar) do not significantly capture the variability in the tree-ring data – although the simulation with volcanic forcing does so for some experiment settings. When all forcings are combined (using either the small- or large-amplitude solar forcing), including also orbital, greenhouse-gas and non-volcanic aerosol forcing, and additionally used to produce small simulation ensembles starting from slightly different initial ocean conditions, the resulting simulations are highly capable of capturing some observed variability. Nevertheless, for some choices in the experiment design, they are not significantly closer to the observations than when unforced simulations are used, due to highly variable results between regions. It is also not possible to tell whether the small-amplitude or large-amplitude solar forcing causes the multiple-forcing simulations to be closer to the reconstructed temperature variability. Proxy data from more regions and of more types, or representing larger regions and complementary seasons, are apparently needed for more conclusive results from model–data comparisons in the last millennium.

While much of our knowledge about climate changes in the past
emerges from evidence in various natural archives

Based on theoretical considerations and some assumptions,

Here, we contribute further to the SUN12 work by discussing practical
considerations arising when using real proxy data series that
represent different seasons and regions of different size, having
different lengths and statistical precision. To this end, we select
a set of 15 tree-ring-based temperature reconstructions, spread across
North America, Eurasia and Oceania, which we use together with the
same set of global climate model simulations

This work also serves as a companion study to the hemispheric-scale
analysis by

To obtain a statistical methodology for ranking a set of plausible
alternative forced simulations, and for identifying forced simulations
fitting the observed temperature variations significantly better than an
unforced model, SUN12 proposed a type of regional (or local) statistical
model relating a climate model simulation time series (

Section 4 of SUN12 demonstrated that, for unbiased ranking, the calibration of
proxy data should aim at achieving the right scaling factor of the true
temperature (

Based on their statistical models, SUN12 developed two test statistics for comparing climate model simulation data with observations. First, before any attempt is made to rank alternative model simulations, testing should be done to establish whether a statistically significant positive correlation can be seen between a simulation series and the observations, because otherwise there is no evidence that the simulations and the true temperature share any effect of the forcing under study.

The correlation pre-test was formulated in SUN12 (Sect. 8) in terms of
a regression type statistic, denoted

A
principle for the choice of weights denoted

Assuming next that a correlation has been established, a distance
between a simulation sequence and an observation sequence is formed
as a weighted mean squared distance,

When an ensemble of simulations driven by the same forcing (but
differing in their initial conditions) is available, they should all
be used in an averaging process. This can be done in two different
ways. Either a

For comparison of different forced models, SUN12 used a normalized
version of

In Sect. 6 of SUN12, a formula is derived for the standard
error of

The test statistics for correlation and distance were derived under specific
assumptions on the climate model simulations (whereas the true climate was
arbitrary). For the purpose of model ranking only, this need not be
considered a problem, but in their role as test statistics we want them to be
robust against imperfections in the statistical model assumptions. In
particular, we want to relax the following assumptions from SUN12:

assumed lack of autocorrelation in the reference model simulations, i.e. these are statistically represented by white noise;

truly unforced reference model, so in particular no joint time-varying forcing in

Concerning the first assumption, it is well known that internal
temperature variation can show autocorrelation, because the climate
system acts as an integrator of the short-term weather variations

The second assumption must be relaxed in order to study the influence of two
or more forcings added sequentially to a climate model, or to compare
simulations with forcings of a similar type to see whether one fits
significantly better than the other. Sequentially included forcings has been
implemented, for example, by

For the question how data from several regions and/or seasons (represented by index

The denominator is the standard error of the numerator, and the
test statistic

If the forcing used in a simulation experiment is realistic, and if, additionally, its simulated climate response is realistic, we expect to see negative observed

Before the statistical framework can be applied, the time resolution
(time unit) to use for the model–data comparison must be decided upon. For
reasonably correct test

We follow

Instrumental temperature data are needed for two purposes. First,
SUN12 argued for using best possible data to maximize the
statistical precision of the model–data comparison. Thus, in most
cases, instrumental data should be used rather than proxy data within
time periods when both exist. Second, instrumental data are needed to
calibrate the proxy data. There are, however, several alternative
temperature data sets to choose between

For the current study, we instead selected the GISS1200 gridded global
temperature data set

Tree-ring data are available from many parts of the globe. They can be
sensitive to climate in different seasons but always have annual
resolution and often explain a substantial fraction of observed
temperature or precipitation variations

Twelve of the 15 tree-ring records have been developed using the
regional curve standardization (RCS) technique

One additional comment should be made in context of the SUN12
framework. The number of trees used in a tree-ring chronology will
most often vary through time; typically there are fewer trees in the
earliest part of a chronology, but the sample size can vary very
irregularly with time. These variations in sample size are known to
cause temporal variations in the variance of
a chronology.

Tree-ring temperature reconstructions used in this study, with seasonal representation as determined by the respective investigators. TRW – tree-ring width; MXD – maximum density; IND – individual standardization; RCS – regional curve standardization; SF – signal-free standardization. Short names and start year used in this study are also given.

Data sources for the tree-ring records.

Selected information for each region: latitude/longitude boundaries,
proxy data calibration period, and correlation

The first decision is to select the season that each proxy record will
represent in the model–data comparison. As each original author team has
generally spent considerable efforts on determining the most appropriate
season for each record – and as the SUN12 framework admits using all
possible combinations of seasons – it seems most natural to follow the
respective original judgements (see Table

A time period (or time periods) is required for calibration of
tree-ring data and, as we argue below (Sect.

Defining the region that each tree-ring series will represent is
a more challenging task. SUN12 stated (in their Sect. 2) that
“typically, this region consists of a single grid box, but averages
over several grid boxes can also be considered”. A single grid-box
temperature may perhaps maximize the statistical precision for
calibration of a single tree-ring chronology, but climate model errors are typically largest at the grid-box scale and decrease with increasing spatial smoothing. It has therefore been recommended that some spatial averaging is applied when climate models are evaluated

Although we cannot give any precise recommendation on how to determine the optimal spatial domain that a proxy record should represent, we assume that there is also a similar need in data assimilation and detection and attribution studies. This appears to be an issue where more research is needed. At this stage, we can at least
suggest a practically affordable way to semi-subjectively define
a reasonable region for each tree-ring record. To this end, we plot
and visually interpret the spatial field of correlations between each
tree-ring record and the appropriate seasonal mean temperatures in
GISS1200 data (Fig.

The spatial correlation analysis was undertaken for calibration
periods chosen above. Each map was then visually inspected to
determine an appropriate region. We did not attempt to define any
objective criterion, but we combined information about (i) where
correlations are strongest, (ii) where chronologies are located and
(iii) information from the literature regarding which regions the data
represent. For example, the TASM area is allowed to extend over much
of the ocean surrounding Tasmania, because

Correlation between each tree-ring chronology and the GISS1200
instrumental temperature field, based on first-differenced data for seasonal
averages and time periods as used for calibration by each original
investigator (see Tables

Location of regions that the 15 tree-ring records represent,
plotted on the land/sea mask of the MPI-ESM model. Regions' short names are
explained in Table

The vastly different sizes of regions, as well as their uneven
geographical distribution and different seasonal representation,
motivates some suitable weights

The equal and area weights are straightforward. The latter are
provided in Table

Hierarchical cluster tree based on nearest-neighbour linkage with
(

The tree-ring data need to be re-calibrated to appropriate regional
and seasonal mean temperatures. Thus, the GISS1200 seasonal mean
temperatures are averaged within each region and calibration is made
for the chosen calibration periods, following procedures explained in
Sect. 4 of SUN12 under the assumption that instrumental noise variance
accounts for 5 % of the total observed temperature variance in
each region (see Sect.

Another decision concerns the time window for which

Estimated lag-1 autocorrelation for time units from 1 to 30 years in
the 3000-year-long unforced control simulation. Two-sided 5 %
significance levels for a white noise process are shown with dashed lines.
Data for each region, identified by the colour legend to the right, are for
the season as specified in Table

Finally, a time unit must be selected, i.e. the length of time periods
over which we average temperatures to obtain the pairs of simulation
(

Estimated autocorrelation function for lags up to 30 in the
3000-year-long unforced control simulation, for time units of 1, 3, 5, 8 and
12 years. Two-sided 5 % significance levels for a white noise process are
shown with dashed lines. Data for each region, identified by the colour
legend to the right in Fig.

The shortest possible time unit is dictated by the resolution of
tree-ring data, which is 1 year. Thus, letting the time unit
be 1 year would maximize the sample size. Therefore, we have
always used the 1-year unit for calibration of the tree-ring records
(in Sect.

To determine this, we analyse the autocorrelation in the 3000-year
control simulation in two ways for each region. First, the lag-1
autocorrelation is computed for all time units from 1 to 30. Then, for
a few selected time units (1, 3, 5, 8, 12 years), the
autocorrelation function is estimated for lags up to
30. Figure

Time series illustration of data used for the

Figure

These results provide some general information. First, they illustrate how the combination of several forcings contributes to give significant correlations with the observed temperature variation, despite the often non-significant results for the individual forcings alone. Note, however, that also greenhouse-gas and orbital forcings are included in the E1 and E2 simulations, although no single-forcing simulations are available with these two forcings. Therefore, we cannot judge how much the latter contribute to the significant test values. Second, the variation among

We can conclude that both
multiple-forced ensembles explain a statistically highly significant
proportion of the temporal variation seen in tree-ring data. Thus, it
is a meaningful exercise to see whether they also fit the observations
better than unforced simulations. Figure

Figure

Clearly, neither of the E1 (low) and E2 (high) solar ensembles is significantly closer to the observed temperature variations than the other. Moreover, results vary between regions. As
concerns the effect of including or excluding the three tree-ring
records where RCS was not used, the regionally weighted results change very little. However, it may be noted that the non-RCS GOA record provides significant positive

Practical application of the SUN12 framework

In this study, we used an ensemble of climate model simulations run
with forcing conditions for the last millennium

When all forcings were combined (land use, small-amplitude

Another improvement to the SUN12 framework made it possible to test
directly whether one of the two multiple-forced simulation ensembles
(i.e. including either small- or large-amplitude solar forcing)
is closer to the observed temperature variations than the
other. However, results were highly variable at the
regional level, which made it impossible to judge whether any simulation
ensemble is more realistic than the other. Thus, this new analysis based
only on tree-ring data from several regions did not show any clearer results than a previous northern hemispheric-scale study based on
several compilations of different proxy data

Although
the weaker solar forcing is more in line with most recent viewpoints

One may ask as to what extent the choice of using only tree-ring data
has influenced the results. For example, their inability to correctly
capture the long-term trend on millennial scales has been discussed by

Deficiencies in the climate model may also influence the results. The model used in this study is a low top model

Information also from other types of proxy data should potentially
help to more conclusively compare a set of alternative simulations
with proxy-based climate observations. All our proxy records reflect
temperatures only in the tree-growth season, i.e. mostly a summer or
an extended summer season. Perhaps the regions used here are too
small, or too few, or not sufficiently well distributed in space. In
combination with a lack of information from winter, this might cause
internal unforced variability to dominate too much over the response
to external forcings. A model study by

There are certainly many more published proxy records (and more are expected to appear in the future) that could potentially be used in this type of model–data comparison study. But it is still somewhat open regarding whether proxy data are best used as individual records, as is the case with most records in this study, or aggregated into larger-scale averages such as in the PAGES2k data set (PAGES2k Network, 2013). In that study, seven continental-scale annual-mean or summer-mean temperature reconstructions (including ASIA2k used here) were derived from different types of proxy data. This latter approach has the potential advantage of reducing the influence from various types of noises, both in proxy data and from internal variability in both models and real climate. A drawback, though, is that seasonally specific information in each proxy is partially lost and the optimal region and season for each large-scale data aggregate is essentially unknown. Thus, more theoretical and practical work addressing questions such as the optimal spatial analysis scale is motivated – in parallel with continued development of climate models, forcing data sets and climate proxy records.

This is a derivation of an adjustment factor for the correlation test
statistic

Suppose we have a climate model

With “climate model”, we
think of a realization of an atmosphere–ocean general circulation
model or an Earth system model integrated in time, with or without
time-varying external forcings. The variable

We consider the correlation test and the

The weighted empirical regression coefficient

First we note that, since

We start by the general formula for the variance of a linear expression,

If we know the autocorrelations, this exact value can be used. Here we
continue by assuming that the

Thus we have an upper bound for the variance as a function of

The second inequality above, where the finite sum was replaced by an infinite
sum, should typically be very close to an equality. On the other hand,
the inequality motivated by Cauchy's formula is likely to be a large
exaggeration of the actual value. There are two parts in this
inequality. First, the sum of squares,

The second part concerns the magnitude of
sums of products relative to the sum of squares. Here we must bring in
the structure of the

Here we investigate the effects of autocorrelation on the

The first two terms of Eq. (

The last term is linear in

Turning to the first two terms, note that they are of identical type, so we
need only study one general such statistic,

There are two possible strategies when choosing the adjustment factor. Either
we simply use formula (

The results above were derived under an AR(1) model for the unforced
climate simulations. Figure

Whether MA(1) is a reasonable description must be judged from data.
Figure

Suppose we want to compare models which all include a particular underlying forcing that is not of current interest. One model has another, additional forcing that is of current interest, whereas another model is a reference model, corresponding to no effect of the additional forcing. In that situation, with its absence of a null model, the correlation test statistic

As in Appendix

See
footnote in Appendix

In SUN12, the additional forcing of concern was represented by terms

For simplicity we assume here that there is no autocorrelation in the

Other terms representing unexplained variability (random fluctuations,
noise) are

The test statistic denoted

When a forcing is present in the reference model, we must expect that this
forcing causes an underlying positive correlation with the
observations on its own. For that reason we must bring in the
reference

The

The first two terms of Eq. (

The third term of Eq. (

The first two terms of Eq. (

We conclude that, also for the first two terms of Eq. (

Additionally, the result above can be used to compare two climate
models of the same kind, but driven with alternative versions of the
type of forcing of interest, to see whether one is significantly better
than the other. We then test the hypothesis that the two models are
equivalent, in the sense of having the same forcing and the same
magnitude of the response to this forcing. Expressed in terms of the
statistical model in Appendix

One additional comment is motivated here. If the two forcings of interest are truly different alternatives with somewhat different temporal evolution, then, clearly, none of the models is a reference. But if the two forcings are just differently scaled versions of the same basic data, thus differing only in their amplitude, then the one with the smaller amplitude could be regarded as a reference, at least for the correlation test.

In our experiment where we compared the E1 and E2 simulations, the
situation is somewhere in between, as the solar forcings differ both in
low-frequency amplitude and in temporal evolution. Because the
different amplitude is of the largest interest, we decided to estimate

We thank the four anonymous referees for careful reading and constructive comments. We also thank Ekaterina Fetisova for fruitful discussion of details in our method. This research was funded by the Swedish Research Council (grants 90751501, B0334901 and C0592401; the first two given to the lead author and the third to Gudrun Brattström at the Department of Mathematics, Stockholm University). Edited by: J. Luterbacher