Consistency of the multi-model CMIP 5 / PMIP 3-past 1000 ensemble

We present an assessment of the probabilistic and climatological consistency of the CMIP5/PMIP3 ensemble simulations for the last millennium relative to proxy-based reconstructions under the paradigm of a statistically indistinguishable ensemble. We evaluate whether simulations and reconstructions are compatible realizations of the unknown past climate evolution. A lack of consistency is diagnosed in surface air temperature data for the Pacific, European and North Atlantic regions. On the other hand, indications are found that temperature signals partially agree in the western tropical Pacific, the subtropical North Pacific and the South Atlantic. Deviations from consistency may change between sub-periods, and they may include pronounced opposite biases in different sub-periods. These distributional inconsistencies originate mainly from differences in multicentennial to millennial trends. Since the data uncertainties are only weakly constrained, the frequently too wide ensemble distributions prevent the formal rejection of consistency of the simulation ensemble. The presented multi-model ensemble consistency assessment gives results very similar to a previously discussed single-model ensemble suggesting that structural and parametric uncertainties do not exceed forcing and internal variability uncertainties.


Introduction
The fifth phase of the coupled model intercomparison project (CMIP5, Taylor et al., 2012) incorporates, for the first time, paleoclimate simulations in its suite of numerical experiments.The last 1000 yr of the pre-industrial period are the most recent key-period identified by the Paleoclimate Modelling Intercomparison Project Phase III (PMIP3, Braconnot et al., 2012).In contrast to the traditional time-slice simulations for specific periods of the past (e.g.Last Glacial Maximum), the PMIP3 "past1000" experiments are transient simulations covering 850 to 1850 AD with time-varying estimates for external drivers, such as orbital, solar, volcanic and land-use climate forcings (Schmidt et al., 2011).The past1000-ensemble bridges a gap between the unperturbed control simulations and the historical simulations for the last 150 yr.It provides simulated estimates of a climate only slightly different from today.Since the ensemble allows for detailed comparisons with climate reconstructions it assists in improving our understanding of past climate forcings and naturally-forced climate variability and, in turn, in fingerprinting anthropogenic climate change (Hegerl et al., 2007;Sundberg, 2012;Schmidt et al., 2013).Assessing the quality of our simulations against paleoclimate estimates provides essential testbeds for our climate models (e.g.Schmidt et al., 2013).
Commonly, validation considers how accurately a simulated data set agrees to ::: with :: the observational data in terms of matching patterns (e.g. Taylor, 2001).Comparison of simulations and :::::::::: Comparing ::::::::::: simulations :::: with : reconstructions implicitly interprets both as representations of the same past.Based on this, their agreement may be taken as validation of the model and their disagreement may highlight model deficiencies.However, we have to take into account the considerable uncertainties in the reconstructions.Thus, we propose that it is appropriate in the past1000context to assess the consistency of the simulations applying methods from weather-forecast verification following, e.g.Annan and Hargreaves (2010) and Marzban et al. (2011) prior to any subjective comparison.This means that In particular, some models (including MPI-ESM) provide paleo-simulations at the same resolution as the historical and the future scenario simulations.Schmidt et al. (2013) emphasise the importance of such a setup for paleo-simulations to be useful in assessing the quality of simulations of the 20th century and of future climate projections.However, in contrast to the COSMOS-Mill ensemble, none of the past1000-simulations were performed including ::::::::: performed :::::: include :: calculations of a carbon cycle.We consider the multi-model analysis to clarify the consistency of simulations under the parametric differences between the models and the common or distinct structural uncertainties of the models (e.g.Sanderson and Knutti, 2012).Therefore, we do not expect a priori increased consistency compared to our earlier results (Bothe et al., 2013).Section 2 gives details on the methodological approach and the employed data before Sect. 3 presents ::: we ::::::: present results on the consistency of the past1000-ensemble with the reconstructions .In Sect.4, we ::: and : identify sources for the found (lack of) consistency : in ::::: Sect.::: 3. :: In ::::: Sect.:: 4 ::: we ::::::: discuss ::: our :::::: results.Short concluding remarks close the manuscript.
2 Methods and data

Methods
To build confidence in a simulation ensemble we may either consider the accuracy of its members in reproducing a given (observed) target (e.g.following Taylor, 2001) or assess its statistical consistency with a target data set (see Marzban et al., 2011).The evaluation of ensemble-consistency follows the paradigm of a statistically indistinguishable ensemble (for a more detailed discussion of the methods see, e.g.Bothe et al., 2013).The underlying null-hypothesis is that the verification target and the simulations are samples from a larger :::::::: common distribution and therefore exchangeable (Annan and Hargreaves, 2010;Rougier et al., 2012).In the paleoclimate context, climate reconstructions are our best estimate of an observed target.
We analyse the ensemble-consistency based on two points of view.Firstly, probabilistic consistency considers the multivariate distribution of ensemble and verification data, and, secondly, climatological consistency considers the climatological distribution of the individual simulations (e.g.Johnson and Bowler, 2009;Marzban et al., 2011;Wilks, 2011).
The probabilistic evaluation addresses how the frequencies of occurrence of the ensemble data compare to those of the verification data.It allows assessing the ensemble variance but also detecting biases.The climatological evaluation analyses the climatological variance and the biases within the individual ensemble members in relation to that of the verification data.
We can assess the probabilistic component of consistency for an ensemble by sorting and ranking the target against the ensemble data (e.g.Anderson, 1996;Jolliffe and Primo, 2008;Annan and Hargreaves, 2010;Marzban et al., 2011;Hargreaves et al., 2011).Counts of the calculated ranks are displayed as histograms following Anderson (1996).Under the null hypothesis of exchangeability (i.e.indistinguishability) of the distributions, a histogram should be flat since frequencies of observed and ensemble estimated data agree for a consistent ensemble (Murphy, 1973).The criterion of flatness does not state that the ensemble indeed is consistent (as discussed by, e.g.Hamill, 2001;Marzban et al., 2011) but it is a necessary condition for our ensemble to be a reliable representation relative to the chosen verification.Marzban et al. (2011) emphasize the climatological consistency.They propose to evaluate it by plotting the difference between the simulated and the target quantiles against the target quantiles.For a consistent simulation, such residual quantile-quantile plots should display a flat outcome at zero.Residual quantile-quantile (r-q-q) plots ease the interpretation compared to conventional quantile-quantile plots.Marzban et al. (2011) and Bothe et al. (2013) provide more details on the advantages of the r-q-q plots.
Residual quantiles and rank counts provide an easily understandable visualization ::::: easily ::::::::::::::::: understandable :::::::::::: visualizations : of deviations of the ensemble relative to the verification data.In r-q-q plots, biases of the ensemble data are seen as displacements from the zero ::::: y = 0 : line.A positive slope in the residual quantiles highlights an overestimation of the difference of the quantiles to the mean (i.e. the variance) compared to the target quantiles.Such a too wide data setis called over-dispersive, :::::::::: indicating ::: an ::::::::::::: over-dispersive ::::: data ::: set.On the other hand, a negative slope highlights an underestimation of the variance, a too narrow data set, to which we refer : to : as under-dispersive.In rank histograms, dome-shapes (U-shapes) indicate too wide (too narrow) probabilistic distributions, i.e. verification data are more often close to (distant from) the mean of the distribution compared to the simulation ensemble.Positive (negative) slopes represent negative (positive) ensemblebiases, i.e. the target data over-populate high (low) ranks.
We use the χ 2 goodness-of-fit test to test for the consistency of a rank count with the uniform, i.e. flat, null hypothesis.Jolliffe and Primo (2008) provide a decomposition of the test to further consider individual deviations from the expected flat outcome.These are, among others, bias and spread deviations.Goodness-of-fit statistics are presented for these two single-deviation tests and the full test and discussed in terms of their p values with respect to the upper 90 % critical values (for single deviation tests it equals 2.706, for the full test with 8 degree of freedom of the χ 2 distribution it equals 13.362).
Analyses under the paradigm of statistical indistinguishability require special care if we use them in the context of paleoclimatology.Any simulated or reconstructed time series over the last 1000 yr includes components of forced and internal variability.If we assume that our estimates of past forcings are approximately correct, there should be a common forced signal in simulations and reconstructions.However, the reconstruction data uncertainties are possibly a lower bound for the disagreement between simulations and their validation target (Schmidt et al., 2013).Our analysis identifies whether the variability, forced and internal :::: total ::::::::: variability, :::: i.e. ::::::::::::::::: externally-forced :::: and ::::::::::::::::::: internally-generated ::::::: together, originates from distributions that are similar enough to be indistinguishable.In this case, we would state that the reconstruction and the simulation ensemble are consistent.
If the variability of the ensemble data deviates significantly from that of the target, our approach identifies inconsistencies.The approach can also be used to highlight in which period the long-term signals do not agree between the reconstruction and the simulation ensemble.Arising lack of consistency in terms of the distributional characteristics indicates that the simulations and the reconstruction :::::::::::::: reconstructions provide different representations of the past climate.Such deviations are informative as they suggest a need for model and reconstruction-method improvements.However, they also limit the validity of conclusions on past climate variability, the climate response to past forcings or the anthropogenic fingerprint.
The assessment of consistency reduces, in principle, the subjectivity associated to the comparison of simulations and reconstructions (compare Bothe et al., 2013).However, the large uncertainties require re-considering ::::::::::::: reconsideration ::: of the importance of distributional deviations (compare Hargreaves et al., 2011;Annan et al., 2011).Over-dispersion does not necessarily question the overall reliability of the ensemble (see Hargreaves et al., 2011;Annan et al., 2011).On the other hand, if a simulation ensemble is found to be too narrow or biased under :::::::::: considering ::: the : uncertainty, further simple comparison studies between the ensemble and the reconstruction may be misleading on the considered scales.We have to consider the suggested lack of consistency in subsequent research, but we may also conclude that, under :::: with the present uncertainties, comparison of simulated and reconstructed estimates is not informative.Note, consistency (i.e.exchangeability) and agreement (i.e.accuracy of the temporal patterns) may differ regionally; inconsistent regions can agree in the signal and regions lacking common signals can be consistent.

Data
If we want to achieve a robust evaluation of the consistency of a simulation ensemble, we have to consider, ideally, more than one data set and more than one parameter, not least because of the prominent uncertainties in climate reconstructions.However, the global temperature field reconstruction by Mann et al. (2009) is the only data that allows for a globally coherent evaluation of a climate parameter for the last millennium.It consists of decadally smoothed data.We further employ area-averaged temperature reconstructions focused on showing two data-sets for the last 500 yr for temperature in Central Europe (Dobrovolný et al., 2010) and Southwestern North America ::: the ::::: North ::::::::: American :::::::::: Southwest (Wahl and Smerdon, 2012).These data sets serve as examples for area-averaged reconstructions.The focus is motivated by the assumption that reconstructions and forcing data are generally more reliable on :::::: during this period (compare Bothe et al., 2013).
We exclude the simulation with MIROC-ESM (by the Atmosphere and Ocean Research Institute at the University of Tokyo, the National Institute for Environmental Studies, and Japan Agency for Marine-Earth Science and Technology) since it shows a problematic long-term drift (A. Abe-Ouchi, personal communication, 2012).On the other hand, a simple correction is performed for the drift of the GISS-R simulations (G. A. Schmidt, personal communication, 2012, see also Schmidt et al., 2012): we subtract a lowess fit (influence of about 600 yr) to the GISS-R pre-industrial control-run (pi-Control) from the annually resolved data of interest, i.e. the grid-point data for the field evaluation and the relevant time series for the area averaged assessment.
We generally use non-overlapping decadal means for simulations and reconstructions in the commonly included period 1000-1849 CE and anomalies relative to this period.For the data from Mann et al. (2009) we choose the central date for each decade (i.e.1844 for the 1840s) since the data are originally decadally smoothed.Results change slightly but conclusions are the same when we employ non-overlapping decadal means for this reconstruction.We also employ the three sub-periods 1000s-1270s, 1280s-1550s, 1560s-1830s to evaluate how consistency may change over time.Inclu-sion of climates strongly biased from our reference period (e.g. the industrial period) would complicate our assessment of consistency focused on paleoclimates.The different grids of the simulations require interpolating the data onto a common T21-grid (∼ 5 • ).
The global field reconstruction further allows evaluating the consistency of approximations of major climate indices which are commonly interpreted to present low-dimensional descriptors of the climate system (compare, e.g.Tsonis et al., 2011;Dima and Lohmann, 2007).Our approach is, as well for these data, a step beyond the pure "by eye" approaches of reconstruction-simulation assessment.We construct two indices as field averages over the Pacific (150 • E-140 • W, 24-53 • N) and Atlantic (74 • W-0 • E, 2-53 • N) domains.Higher latitudes are excluded to avoid effects from sea-ice variability.Simulated indices are calculated from surface air temperatures, in contrast to the common definition via sea-surface or upper ocean temperatures.This appears justified since the reconstruction is a hybrid representation of sea-surface and near-surface air temperature (compare Brohan et al., 2006;Mann et al., 2009) with only a minority of underlying proxies being of marine origin.The indices are denoted by AMO (Atlantic Multidecadal ::::: PDO ::::::: (Pacific :::::::: Decadal : Oscillation) and PDO (Pacific Decadal ::::: AMO :::::::: (Atlantic :::::::::::: Multidecadal : Oscillation), although our definitions differ from the convention (e.g.Zanchettin et al., 2013a, and references therein).We do not preprocess the input data and do not standardize the series.The indices accumulate globally-and regionally-forced as well as potential internal signals.:::: Our :::: later ::::::::::: conclusions ::: are ::::: robust ::::::: against :::::::: different :::::::::: definitions :: of ::: the :::::::: regional ::::::: indices.: We have to include an uncertainty estimate in our analyses by inflating the simulation ensemble (compare Anderson, 1996).These are generally randomly selected from a Gaussian distribution with zero mean.For the regional reconstructions, the standard deviations are the reported standard errors, while, for the constructed indices, we use a standard deviation of 0.2 K which approximates the estimates given for similar indices by Mann et al. (2009).For the field reconstruction, we follow Bothe et al. (2013) and take the largest standard error (σ ≈ 0.1729) reported for the Northern Hemisphere mean temperature series of Mann et al. (2009) as a reasonable uncertainty estimate for the field data.
We observe that neglecting uncertainties in the reconstruction can lead to pronounced differences in our inferences about the probabilistic and climatological consistency of the ensemble.That is, the ensemble may appear under-dispersive or even consistent excluding the uncertainties, although it is found to be over-dispersive if they are considered.
Our knowledge of past forcings is rather weakly constrained as seen in comparisons of the available reconstructions for land-use, total solar irradiance and volcanic eruptions as compiled by Schmidt et al. (2011, see also discussion by Schmidt et al., 2013).For the employed simulations, differences are especially noted in the volcanic forcing, which in turn implies that they mostly influence the annual time-scale and pentad data.We will recur ::::: return : to this point in our discussion of origins of (lack of) consistency in Sect. 4.
Consistency in our setting depends on the reference time.Due to the way the reconstructions are produced, it is in principle advisable to center ::::: centre : all data on the calibration period of the reconstruction (J.Smerdon, personal communication, 2012) since this time is the reference for the calculation of uncertainties.We instead center :::::: centre : our data over the full studied period, and thereby shift the focus from the complete comparability over pre-industrial and industrial times ::: on to the comparability of the variability over the preindustrial time only.

Global field consistency
Figure 1 gives a first impression of the probabilistic consistency of the past1000-ensemble with the global temperature field reconstruction by Mann et al. (2009).We display the p values of tests for a uniform outcome of the rank counts at every grid-point.The goodness-of-fit test leads to the rejection of the null hypothesis of a uniform outcome at gridpoints for a p value larger than 0.9 (red in Fig. 1).Thus, the analysis shows a lack of consistency for large areas of the globe for the full period (Fig. 1a).In contrast, the European Arctic is the only spatially extended area for which rank counts deviate significantly from uniformity for an arbitrary shorter period (Fig. 1b).Possible consistency is diagnosed elsewhere.If we test for individual deviations of bias or spread, at least one of them is significant over much of the globe for both, the full and the shorter period (Fig. 1c, and d).
Obviously, the sample for each sub-period is small since we use non-overlapping decadal means with the full-period consisting of 85 data points.
For the same sample, rank counts confirm probabilistically the result of a generally over-dispersive ensemble by showing predominantly dome-shapes (Fig. 2, right panels), :::: i.e.
Thus, according to probabilistic and climatological considerations the ensemble appears to be often over-dispersive relative to the field reconstruction.The too-dispersive ensemble-character agrees with the findings of Bothe et al. (2013) for the COSMOS-Mill-ensemble (Jungclaus et al., 2010).Furthermore, since the distributional evaluation suggests changes over time in the relation between reconstruction and ::: the ::::::::::::: reconstruction ::: and :::: the simulation ensemble, we can infer that reconstruction and simulations often do not :::::: rarely : represent the same climate trajectory.Neither the single-model-ensemble (COSMOS-Mill) nor the past1000 multi-model-ensemble reliably represents the climate evolution suggested by the reconstruction.However, this may be due to the uncertainties associated with the verification data.temperature indices for both domains (see Sect. 2.2 for details).Accounting for uncertainty in the reconstructions, the full-period residual distributions of simulated Pacific (PDO) and Atlantic (AMO) time-series arise as to some extent over-dispersive (Fig. 5a and c) mainly due to an overestimation of the tails especially for negative anomalies.

Consistency
Residuals are nearly negligible for some simulations.The 90envelope for a block-bootstrap approach marginally includes the zero line of consistency for the PDO.Thus, the sampling uncertainties prevent rejecting consistency.
For both indices, the reconstructed distributions for sub-periods shift from mainly positive temperature anomalies towards negative anomalies (Fig. 5b and d).
The associated residual quantiles resemble the temporal development of residual grid-point-data quantiles.A slight cold bias in the early sub-period with especially large deviations for the cold tail changes to a generally over-dispersive relation in the latest sub-periods.Distributional changes between the last two sub-periods are less prominent for PDO compared to AMO.
The full-period rank counts are significantly over-dispersive for both PDO and AMO (Fig. 5e and g) according to the goodness-of-fit test, indicating that the ensemble is not probabilistically consistent with the reconstruction for both indices.The bootstrapped intervals confirm the rejection of consistency although only marginally for the AMO.
For the sub-periods, the simulation ensembles are significantly biased for both indices in the early and significantly over-dispersive in the central sub-period (Fig. 5f  and h).Bias and spread are significant for the PDO in the last sub-period, but only bias is significant for the AMO then (Fig. 5f and h).Especially prominent are the over-dispersion for the late-period PDO estimates and the bias for the late-period AMO.
Thus, the regional indices confirm the field assessment result of a simulation-ensemble that tends to be over-dispersive relative to the global field reconstruction.Again, the simulation data do not reproduce the notable changes in the reconstructed distributions.

Consistency of regional reconstructions
We consider additional regional area-averaged temperature reconstructions to evaluate whether the mixed result relative to the field reconstruction is representative.Already the prominent uncertainty of climate reconstructions requires such additional evaluations.
We show only results for annual central European and annual Southwestern North America temperatures starting from 1500CE.Other regional reconstructions were assessed as well but are not discussed in depth.Accounting for uncertainties in the reconstructions, residual quantile distributions indicate often full-period over-dispersion for the decadal Central European temperature data (Fig. 6a).On the other hand, the data for Southwestern North America is mainly consistent (Fig. 6b).Nevertheless deviations occur for some simulations but are not significant.These include an over-as well as an under-estimation of the cold tail and an over-estimation of the warm tail.The differences among simulations are more diverse for climatological residual quantiles relative to the Southwestern North America reconstruction compared to the results concerning the large-scale indices and the grid-point data.Climatological relations can differ remarkably for different regional reconstructions as exemplified by the Central European and Southwestern North American data.
Rank histograms (Fig. 6c and d) indicate that the ensemble is probabilistically consistent with the Southwestern North American reconstruction, but the χ 2 goodness-of-fit test leads us (only just) to reject uniformity for the European data rank counts at the considered one-sided 90level (Fig. 6c  and d).Similarly, the bootstrapped intervals in Fig. 6a-d do only just result in rejecting consistency for the European data but they in principle confirm the consistency for the American data.The bootstrapped envelope also highlights the high sampling variability.We note that over-dispersive deviations are much smaller for the Central European data than for the large-scale indices or the grid-point data.
Thus, the evaluation indicates better consistency of the ensemble relative to the two semi-millennial regional annual reconstructions than for either large-scale indices or grid-point data during the full period.On the other hand, analyses on additional millennial-scale reconstructions indicate usually stronger climatologically and probabilistically over-dispersive relations with, again, notable variations in consistency over time (not shown).These regional area-average data-sets often differ more strongly in their variability from the simulation ensemble than the central European and annual Southwestern North America data.
We note that the ensemble shows negligible under-dispersion relative to the European reconstruction if we exclude the uncertainties (not shown), but it indicates slight over-dispersion under uncertainties (Fig. 6a).One could argue that, for an ideal ensemble, such rather weak opposite deviations indicate a consistent ensemble and only an over-estimation of the target uncertainties .
To identify sources of disagreement we first consider mapped correlation coefficients between simulations and reconstructions (Fig. 3).Again we employ non-overlapping decadal means.Each simulated and reconstructed time-series represents one realization of a climate response to the employed radiative forcing perturbations modulated by the internal variability.We also expect differences in parametrisations and methodologies to affect the outcome.We consider correlation analysis as an example for common tools : a :::::::: universal :::::::: method in studies comparing simulations and reconstructions.Finding significant correlations between individual simulations and the reconstruction indicates that both data-sets to some extent feature a similar signal but it does not give information about the origin of the signal or whether the origin is common in both data sets.
Indeed, mapped correlation coefficients suggest various degrees of agreement between individual simulations and the reconstruction (Fig. 3).Correlations are significant nearly everywhere for the ensemble mean (two-sided 99 % level) and CCSM4 (two-sided 90 % level) but less widespread for the other simulations.Most simulations correlate significantly negative with the reconstruction at some grid-points over Antarctica.All simulations correlate significantly over the western tropical Pacific, the subtropical North Pacific and the South Atlantic.The simulations and the global reconstruction do not agree on the, possibly externally forced, phasing of variations in Antarctica and the eastern and central tropical Pacific.We note that Mann et al. (2009) report a pronounced cold anomaly in the tropical Pacific for the Medieval Warm Period (MWP).Prominent gaps in significance are also visible for the ensemble mean and for CCSM4 over the sub-polar North Atlantic, the tropical Pacific, the Indian Ocean and central Eurasia.Similarities in the correlationpatterns may be interpreted as reflecting not only the intraensemble forcing-variability/-similarity but also the association between the models (compare Masson and Knutti, 2011).
Latitude-time plots of zonal means allow further comparison of the different data sets (Fig. 4).The reconstruction represents a near-global transition from positive anomalies in the first half (the MWP) to negative anomalies in the second half (Little Ice Age, LIA) of the considered 850 yr period (Fig. 4).The zonal means are possibly not representative in high southern latitudes due to data sparseness.The strongest warmth occurs at the beginning of the millennium.Episodic warmth interrupts the LIA during the 15th and 18th centuries and is generally confined south of 50 • N.
The simulations neither capture the timing of the strongest warmth nor the near-global MWP-LIA transition.The ensemble generally displays near-stationary warm conditions.Short cold episodes related to assumed volcanic eruptions interrupt this warmth.Their timing, amplitude and spatiotemporal extent are similar in individual simulations.Weaker cold excursions reflect to some extent the variety of the employed forcings for reconstructed volcanic eruption properties (compare Schmidt et al., 2011).The ensemble mean differs most notably from the reconstruction in the lack of persistent northern hemispheric cold anomalies after about 1450 and in a stronger simulated cold signal in the 13th century.Otherwise it visually agrees well with the reconstruction.
We note that ensemble-mean correlation coefficients are often especially high (Fig. 3h) close to the proxy-locations employed by Mann et al. (2009).This implies stronger commonalities at those locations where our proxy-information about past climates are collected.That is, the similarity may allow inferring that simulations, reconstructions and underlying proxies as well as the forcing series relate to a similar underlying climate signal.Such inference is in accordance with the results of Schurer et al. (2013), Fernández-Donado et al. (2013) and Hind et al. (2012).On the other hand, the hypothesised common signal is concealed by the internal variability of the simulated climates and the additional sources of noise associated with simulations and reconstructions.We find no identifiable relation :::::::::: relationship : between the reconstruction and the simulations at the grid-point level.5i and j).Amplitudes agree less than tendencies.The most prominent example of differences in long-term trends leading to biased estimates is the different timing of medieval warmth.Note further the strong disagreement due to, on average, colder reconstructed indices from the 14th to 17th centuries for the PDO and from the 16th to 18th centuries for the AMO.

Atlantic and Pacific
The indices display some intra-ensemble and ensemblereconstruction agreement.Again we discuss correlations as example for common practices.Ensemble-mean indices correlate at r ≈ 0.5 with the reconstructed ones.Correlations with the reconstructed index are larger than 0.5 for the PDO in FGOALS and for the AMO in CSIRO.Correlations among simulations larger than 0.5 are only found for the AMO and most prominently for MPI-ESM and CSIRO.PDO and AMO correlate strongest in the reconstruction, CSIRO and the ensemble mean (r > 0.8).If the analysis is repeated for globally detrended data, no strong correlations are seen between the reconstructed and simulated indices.
We did not discuss regional average indices in Bothe et al. ( 2013), but in both regions the COSMOS-Mill simulations displayed more variability than the reconstruction and, for the North Atlantic, the ensemble-consistency changed strongly between the considered sub-periods.
3.4 ::::::: Sources ::: of :::::::: regional ::::::::::::: disagreement Figure 6 clearly displays that there is no common signal in the regional average time-series for Central Europe for individual simulations and reconstructions.This was similarly seen for the annually-resolved Central European temperature indices of the COSMOS-Mill ensemble.Obviously internal variability and methodological uncertainties dominate over the forced variability on the decadal and the inter-annual time scale for both ensembles.However, the COSMOS-Mill ensemble is consistent with the annual data of Dobrovolný et al. ( 2010) on the interannual time scale.Compared to the European data, the past1000-simulated Southwestern North American ::::: North ::::::::: American :::::::::: Southwest temperature series agree slightly better with the respective reconstruction for the non-overlapping decadal means.Considering the full ensemble, no common forced signal can be found.Thus we do not further comment on the accuracy of both datasets.

Summary and conclusions
The CMIP5/PMIP3-past1000-ensemble is not generally consistent with the global temperature reconstructions by Mann et al. (2009) on a decadal time-scale.This holds for the probabilistic and the climatological assessments.Inconsistencies between reconstructions and simulations prevent reconciling both paleoclimate estimates.Our assessment of consistency over the last millennium can be biased towards being too optimistic if existing discrepancies between different multicentennial sub-periods counter-balance each other.
The simulations and the reconstruction agree least in the tropical Pacific and the sub-polar gyre region of the North Atlantic according to our evaluation, while agreement is largest in the sub-tropical Pacific and the South Atlantic.The largescale significant correlations for some individual simulations and the ensemble mean indicate that the reconstruction and the simulation ensemble possibly include a common signal.
The PMIP3-past1000 multi-model ensemble and the COSMOS-Mill single-model ensemble (Bothe et al., 2013) give very similar results with respect to their consistency, although differences exist for the diagnosed climatological deviations, which we attribute to different handling of volcanic forcing data.So, multi-model and single-model ensembles similarly lack consistency with the reconstructions.Thus, the uncertainty due to structural differences and parametrisations in the models does apparently not exceed the uncertainties associated with different forcing and initial conditions.
The PMIP3-past1000 simulation-ensemble and a selection of global and regional validation reconstruction targets are often not exchangeable climatologically and probabilistically.Therefore they should not be regarded as representing the same climate, i.e. they should not be compared under that implicit assumption.
These results imply the following: 1.The ensemble may be consistent with the verification data for either the full or for sub-periods at the gridpoint level and for area-averaged data, but only few data show consistency on both time scales.
2. If consistency is diagnosed only for the full period, the ensemble and the reconstruction display a comparable amount of variability and a comparable climatological range over this period, but the long-term trends differ notably.We can also conclude that the variability differs between frequency bands, e.g. the reconstruction displays larger multi-centennial but smaller decadal variability and vice versa.Furthermore, analyses depending on the background climate are hampered by the lack of sub-period-consistency.
3. If, on the other hand, the data is consistent for a subperiod, analyses on the dynamics may be valid over this period, but not necessarily in other periods.Even then we have to be careful since considering a different reference period climatology may lead to different results and the background climate may influence our assessment of the dynamics.
The deviations from consistency disclose the necessity of improvements for ::: for :::::::::::::: improvements ::: of :: simulated estimates, their reconstructed forcing data and for climate-reconstructions.The large uncertainties render difficult any firm conclusions on past climate forcing and past climate variability (e.g.Hind et al., 2012).:::: The Summarizing, probabilistic and climatological evaluations indicate consistency of the PMIP3-past1000 simulationensemble with the reconstructions in some regions and over certain sub-periods, but strong biases and/or dispersiondeviations arise in other regions and periods.This dominant feature of the analysis prevents reconciling the simulated and reconstructed time-series either with respect to a common naturally forced climate signal or with respect to an estimate of internal variability.
The service charges for this open access publication have been covered by the Max Planck Society.
Table 1.Selected climate model simulations and their acronyms, the institutes of origin and the respective solar and volcanic forcing data sets.Full references are (from top to bottom): Vieira et al. (2011), Gao et al. (2008), Steinhilber et al. (2009), Crowley (2000), Jones and Mann (2004) and Crowley et al. (2008).Fig. 2. Grid-point analysis of ensemble consistency for three sub-periods: 1000s-1270s, 1280s-1550s, 1560s-1830s).Left three columns: residual quantile-quantile plots for a selection of grid-points for the first (light gray), second (dark gray) and third (colored) sub-periods.Right three columns: rank histogram counts for the selection of grid-points for the three sub-periods (first to last, light to dark gray) and the full period (black, scaled to match frequencies in sub-periods).Large (small) red squares mark grid-points where spread or bias deviations are significant over the full (from left to right the first to third sub-)period.Blue squares mark deviations which are not significant.::::::: Residual :::::: quantile ::::: plots :::: show ::: on ::: the : x :::: axis ::: the ::::::: quantiles :: of ::: the ::::: target :::: and :: on ::: the :: y ::: axis ::: the :::::::: difference :::::::: between :::::::: simulated ::: and ::::: target :::::::: quantiles.Consistency of the indices for the North Pacific (PDO) and North Atlantic (AMO) regions.(a-d) Residual quantile-quantile plots for (a, c) the full period and (b, d) three sub-periods (defined as for Fig. 2) of 28 records (early, light gray, middle, dark gray, late, colored).(e-h) Rank histogram counts for (e, g) the full period and (f, h) the three sub-periods (light gray to black).Numbers are the χ 2 statistics for the periods.In (f, h) numbers refer, from left to right, to the early to late sub-periods.Blue horizontal lines give the expected average count for a uniform histogram.(i, j) Time series of the indices constructed from non-overlapping decadal means.Color-code as in legend except for shading.Shading for residual-quantiles and rank-counts (a, c, e, g) gives the 90envelope of block-bootstrapping 2000 replicates of block-length 5. Full-period residual quantile-quantile plots (left panels), rank counts (middle panels) and time series plots (right panels) for the reconstructions by (top panels) of Central European annual temperature and (bottom panels) of Southwestern North America annual temperature.For details on the representation see the caption of Fig. 5.

Fig. 3 .Fig. 4 .
Fig.3.Mapped grid-point correlation coefficients between surface air temperature series from the considered simulations and from the reconstruction.See panel titles for individual simulations.Ensemble mean in (i).Gray (black) dots mark two-sided 90 % (99 %) confidence.