Using paleo-climate comparisons to constrain future projections in CMIP5

Introduction Conclusions References


Introduction
The Coupled Model Intercomparison Project (Phase 5) (CMIP5) is an ongoing coordinated project instigated by the Working Group on Coupled Modelling (WGCM) at the World Climate Research Programme (WCRP) and consisting of contributions from over 25 climate modeling groups (and over 30 climate models) from around the world past and future simulations, this dataset is a unique resource for research into the connections between model skill and model predictions, and has the potential to greatly improve assessments of future climate change.
There were many uncertainties in climate projections highlighted in the IPCC AR4 (Meehl et al., 2007). Many of these, such as the future of sub-tropical rainfall, El Introduction For clarity in the rest of the text, we define the term "ensemble" to denote the full multi-model database of results across all scenarios (which here encompasses all paleoclimate, historical, idealized and future projection simulations). The future projections used here consist of the four RCP scenarios (rcp26, rcp45, rcp6, rcp85) (future possibilities that roughly produce radiative forcing at the year 2100 relative to 2000 of 5 2.6, 4.5, 6.0, and 8.5 W m −2 , respectively) along with idealised simulations have been included to provide clean comparisons across models (such as 1 % increasing CO 2 simulations, the response to an abrupt increase to 4 × CO 2 , atmosphere-only simulations etc.). For ease of reference, we will use CMIP5 to refer to the entire database, including the PMIP3 simulations. Specific model simulations are referred to by their 10 name in the CMIP5 database (i.e. rcp85, past1000, PIcontrol etc.), while the scenarios or periods when are referred to more generally using a standard abbreviation or name (e.g. the LGM, MH, RCP 4.5). The scope of the paper is as follows: Sect. 2 discusses theoretical frameworks for dealing with the multi-model ensemble, issues arising from the use of paleo-proxy data 15 and the use of data-synthesis products; Sects. 3 and 4 discuss specific examples of skill metrics that may have predictive power in future simulations by showing robust behaviour across paleo and future experiments, or discriminate between future projections. Sect. 5 presents some exploratory analysis of additional potentially useful metrics that either diverge over time or are too sensitive to important uncertainties; Sect.

Palaeoclimate reconstructions
Many of the problems in dealing with reconstructing climate from paleodata are specific to the type of record, the time period and resolution concerned -for instance, annually resolved tree rings have issues distinct from lower resolution ocean sediment or pollen 5 records. (e.g. Kohfeld and Harrison, 2000;Ramstein et al., 2007;Jones et al., 2009;Harrison and Bartlein, 2012). There are however a number of general issues that affect the use of such data for model evaluation, including the potential for multiple climate controls on a given record, the scale over which they are representative, the need to quantify (and take into account) reconstruction uncertainties, and the sparse and 10 uneven site coverage. Records used for palaeoclimate reconstructions are in general influenced by several different aspects of climate as well as, potentially, non-climatic factors. For instance, oxygen or hydrogen isotopes from ice cores, carbonates or organic matter are physically meaningful variables, but do not necessarily have a one-to-one stationary re- 15 lationship with temperature or precipitation (e.g. Werner et al., 2000;Schmidt et al., 2007;Masson-Delmotte et al., 2011). Vegetation, in addition to being influenced by several aspects of seasonal climate, is directly influenced by the atmospheric CO 2 concentration (Prentice and Harrison, 2009). There are several approaches that have been adopted to overcome this type of problem: the use of multi-proxy reconstruction 20 techniques, forward modeling of the system within a climate model or using climatemodel output (see an example related to coral carbonate isotopes in Sect. 5.1), and model inversion or data assimilation. Multi-proxy reconstructions rely on the idea that different types of record will be sensitive to different aspects of climate, and that pooling the information from each of these records therefore provides a more robust re- 25 construction of any specific climate variable. In the sense that forward modeling (and by extension model inversion techniques) are based on physical and or physiological knowledge of the given system, the use of these approaches may be a more robust 781 Introduction way of dealing with the non-stationarity issue -however, as with climate models, the results are constrained by the quality of the models and the degree to which the system is well-understood (see for example the discussion of CO 2 fertilisation in Denman et al., 2007). The scale over which a record is representative can be a major issue in comparing 5 paleodata and model output. All types of records are responding to local conditions, and for basic meteorological variables it is rare for a record to be representative for spatial scales of more than 50-100 km (though many records, such as tropical ice core δ 18 O, may have strong correlations to climate further afield; e.g. Schmidt et al., 2007). Comparisons at these scales often require some form of dynamical or statistical down-10 scaling of model output, though there are many associated issues (Wilby and Wigley, 1997). Alternatively, up-scaling reconstructions (for instance, through the use of gridding) can often reveal large-scale patterns that models could be expected to resolve, although this requires a sufficiently dense network of sites. Recent developments include the use of cluster analysis to classify types of model behaviour and to determine 15 cohesive regions for comparison with the large-scale patterns in the observations (e.g. Bonfils et al., 2004;Brewer et al., 2007;Harrison et al., 2013). Paleoclimate reconstructions are usually accompanied by estimates of measurement or statistical uncertainty. However, in past practice these uncertainties were rarely propagated into large-scale synthetic products (except in terms of non-quantitative quality 20 control measures, see e.g. COHMAP, 1988) and even more rarely taken into account when the reconstructions were used for model evaluation. However, quantitative measures of uncertainty have been included in more recent palaeoclimate syntheses (e.g. MARGO, 2009;Bartlein et al., 2011) and the use of fuzzy-distance measures (Guiot et al., 1999) provides an explicit way to take account data uncertainties in data-model Introduction

Paleo-modelling issues
There are two particular issues that are more problematic in paleoclimate simulations than, for instance, simulations of the 20th Century: model drift and forcing uncertainty. The issue of coupled climate model drift arises because of the long (∼ thousands of years) time required to bring the deep ocean into equilibrium in coupled ocean-5 atmosphere models. In some cases, insufficient spin-up time may have been allowed before specific experiments are started. While drift also affects transient historical simulations, the relatively large forcings in the 20th Century mean that residual drift is usually a small component of the transient response. For simulations of the last millennium though, the forcings are much smaller, and drift in the early centuries of the simu-10 lation will be a larger fraction of the modelled change (Osborn et al., 2006;Fernández-Donado et al., 2012). One proposal to deal with this is via a correction using the drift in the control simulation (i.e. calculating a smooth trend and removing it from the perturbed simulation prior to analysis). While this works well for temperature, it is not very good for variables that exhibit threshold behaviour such as sea ice extent or precipita-15 tion. In practice, this issue needs to be assessed for each proposed comparison. Secondly, there are important uncertainties in the forcings used for the paleoclimate experiments. This is also true for aerosols in the 20th Century simulations, for instance, but such issues are more prevalent in paleo-simulations. For example, both the magnitude of solar or volcanic forcing over the last millennium, and the size and height of 20 ice sheets at the LGM are sources of major uncertainty. In the last millennium experiments, multiple forcing choices were proposed (Schmidt et al., 2011(Schmidt et al., , 2012, but few groups have attempted (as yet) to comprehensively explore all the options, and this is also true for uncertainties associated with other time periods. If an insufficient range of different forcings is tested, it is plausible that mismatches between observations and Introduction It should also be noted that multi-model ensembles are not a controlled sample from a well-defined distribution of plausible simulations. Models are necessarily incomplete and there are common biases that have more to do with the state of computational technology than physics (for instance, poor or non-existent resolution of ocean eddies). Multi-model ensemble means can be informative and will generally outperform 5 individual models , but care must be taken to assess the suitability of each included model and weighting of individual models needs to be well justified (Knutti et al., 2010).

Approaches to comparing reconstructions and simulations
There has been a gradual evolution in the approaches for comparing reconstructed 10 changes and simulations from essentially qualitative graphical comparisons of model output and reconstructions of the corresponding climatic variables (e.g. Braconnot et al., 2007) to more quantitative approaches that measure model-data mismatch via some "metric" or distance function (e.g. Sundberg et al., 2012;Izumi et al., 2013). Metrics based on correlations or RMS differences between fields of modern data 15 and model output have been commonly used in model evaluation (e.g. Taylor, 2001;Gleckler et al., 2008). These methods provide opportunities for both inter-and intra-generational model comparisons (Reichler and Kim, 2008;Harrison et al., 2013).
Focusing on the collective performance of the ensemble as a whole, Hargreaves 20 et al. (2011) tested the ability of the PMIP2 ensemble to represent the Last Glacial Maximum in terms of its "reliability"; defined as the adequacy of the ensemble, considered in probabilistic terms, in predicting the changes documented in the paleo-climate archives during that interval. The concept of "skill" as adopted in the numerical weather prediction community is also useful as a quantitative test of model performance: that is, 25 does a model produce a more accurate prediction (match to the paleo-climate record), than that which would be achieved by a simple null hypothesis? (Hargreaves et al., 2012b). nothing precludes comparing the simulations and paleo-record in the frequency domain. Recent work, has looked at the fluctuations in forcings and data as a function of timescale, and in principle, these fingerprints could also be useful . 5 There are two main ways in data-model comparisons can be used as a guide to the future -either as a validation of a robust relationship across models and scenarios, or as a method to discriminate between different models. A prerequisite for the latter example is that the metric chosen actually correlates to future outcomes within the ensemble. If this is not the case, then the metric is orthogonal to the spread in the 10 projections and cannot be used to constrain it. Even when such a relationship is found, we need to consider whether it is physically meaningful to be confident that it has not arisen either though chance due to a small sample size or as an artifact of the model or the experimental design. While connections may in principle be highly complex, it is natural as a first step to consider whether a correlation exists between past and future 15 behaviour in the same diagnostic. The search for useful metrics (in this sense) using modern data has generally been disappointing (Knutti et al., 2010), although there have been a small number of cases where apparently meaningful relationships have been found (Boe et al., 2006;Hall and Qu, 2009;Fasullo and Trenberth, 2012). It is notable that the first two examples relate future climate changes to externally-forced changes 20 in the modern climate (relating to decadal trend, and seasonal range, respectively), rather than using metrics based on the climatological mean state alone. This lends support to our working hypothesis that past variations seen in paleoclimate simulations might also be informative about the future as well as increasing understanding about the past. 25 Where a credible relationship between past and future is found, there is a range of methods that can be applied to use observations to constrain future predictions (Collins et al., 2012). One method, applied by both Boe et al. (2006)  is to take the observational estimate, and use the relationship (often linear) embodied in the correlation of model output to project this value into the future. An attractive feature of this approach, beyond its simplicity, is that it readily allows extrapolation of the observed relationship in the case where the true value is suspected of lying outside the model range. An alternative approach, which has been widely applied to perturbed 5 physics ensembles is more explicitly Bayesian, considers the ensemble as a probabilistic sample. For the prior, equal weight is typically assigned to each ensemble member. Probabilistic weights are then calculated for each member of the ensemble, according to their performance in reproducing the observations. This weighted ensemble now represents the posterior estimate of future change. This method uses the model spread 10 as a prior constraint, which depending on one's viewpoint, and the specific case in question, may be considered either a strength or weakness of this approach. These and other methods are discussed in more detail in Collins et al. (2012).

Robust metrics
In this section we highlight physically-based correlations between key metrics that show 15 similar patterns in the paleo-climate runs and in future projections (or more idealised scenarios). With evaluation via the paleo-climate record, these metrics can be considered robust, and thus provide contingent predications of one variable given a potential change in the other.
3.1 Patterns of regional climate change vs. global means 20 The main climate forcings for the LGM are the lower concentrations in atmospheric greenhouse gases and the presence of Laurentide and Fenno-Scandinavian ice-sheets in the northern extratropics. The ice sheets have a strong local albedo effect (e.g. Braconnot et al., 2012) but also affect the mid-latitude large-scale atmospheric circulation due to the associated change in topography ( , 2009). However, away from this perturbation for the atmospheric radiative budget and for the atmospheric dynamics, we expect that the greenhouse gas forcing would be the main forcing for the LGM climate change. There could then be a relationship between LGM climate change and future climate change for a given model, which could be useful in testing the ability of climate models in reproducing 5 regional climate change relative to the global change. Figure 1 shows the results comparing the mean annual surface air temperature change over a region compared to the global mean change for the abrupt 4xCO2, 1pctCO2 and lgm CMIP5 simulations across a suite of models. We have considered the tropics (land + oceans) and the tropical oceans, which have been used previously in 10 perturbed physics ensemble studies (Schneider von Deimling et al., 2006;Hargreaves et al., 2007), East Antarctica, for which the temperature change is shown to scale with global temperature change for the LGM and the CMIP3 2xCO2 and 4xCO2 changes (Masson-Delmotte et al., 2006a,b) and the well documented mid-latitude region of the North Atlantic and Europe. 15 For the tropics, for land and ocean points as well and ocean points only, Fig. 1 shows that the relationship between the regional and global temperature change exists for the 1pctCO2 and abrupt 4xCO2 anomalies, and is consistent across these two experiments. Such a relationship also exists for the LGM, but the slope is clearly lower than for the increased CO 2 experiments. Furthermore, the models which simulate the small-20 est warming for increased CO 2 are not those which simulate the smallest cooling for LGM (and similarly for the models with the largest warmings and coolings). The regional vs. global temperature change relationship appears more consistent between LGM and increased GHG forcings for East-Antarctica and, surprisingly, over the North Atlantic/Europe region. However, for the tropics, rankings of the models according to 25 their cooling for LGM and warming for 1pctCO2 and abrupt 4xCO2 are not consistent. This shows that either the impact from the lower GHG concentrations are not symmetric compared to those for increased GHG concentrations, or that the ice-sheet remote impact extends to the tropics (Laîné et al., 2009

Land-ocean contrasts
Model results have consistently shown that for the LGM, the continents cooled more than the ocean (e.g. Braconnot et al., 2007Braconnot et al., , 2012Laîné et al., 2009), while, in a symmetric manner, predictions for future climate show a stronger warming over land than over the oceans (e.g. Sutton et al., 2007;Drost et al., 2011). The ratio between cooling 5 over land and cooling over the ocean for the LGM tropics was ∼ 1.3 in the PMIP1 computed sea surface temperature (SST) simulations (Pinot et al., 1999), a result close to the ratio of ∼ 1.5 found for the PMIP2 fully coupled LGM experiments (Braconnot et al., 2012) and conspicuously close to the 1.5 ratio found by Sutton et al. (2007) for future climate.

10
This relationship also holds in the most recent CMIP5 simulations ( Fig. 2) not only for the tropics but also for the well-documented region of the North Atlantic and Europe, consistent with the LGM data. It is worthwhile to note that this pattern was previously used to highlight the inconsistency in an earlier compilation of tropical LGM sea surface temperatures (Rind and Peteet, 1985). We conclude that these relationships are indeed 15 robust, although they appear imperfectly understood (Lambert et al., 2011).

Regional extremes
Extreme climate events such as heatwaves and cold spells can have long lasting impacts on society or ecosystems (IPCC SREX, 2012). The development of such events spans days to a few weeks, so that they are largely intra-seasonal by nature (Senevi-20 ratne et al., 2012). In such a context, the generally linear relationship between reconstructions and actual climate can be strongly distorted. Hence, since extreme events are by definition rare, large numbers of examples are required to get good statistics. Simulations of the past millennium offer a promising tool to investigate modeled extremes since they sample a large range of possible cases. The strongest limitation for Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | data that record extreme variables (Jomelli et al., 2007). However, if we can demonstrate the robustness of the relationships between short and longer term statistics over long periods of time, and/or their dependence on external forcings, we can potentially predict the behavior of temperature extremes in the future. The statistical analyses of (daily) temperature hot extremes of the 20th century have 5 shown that temperature is generally a bounded variable, for which the upper bound can be computed from the statistical parameters of extremes (Parey et al., 2010a,b). Diagnostic studies focusing on the probability distribution of temperature and precipitation extremes are often based on the application of Extreme Value Theory (EVT), though simpler metrics have also been used (e.g. Hansen et al., 2012). EVT describes the behavior of the probability distribution near the tails, and allows one to compute return levels for return periods that are longer than the period of observation (Coles, 2001). It has been applied to meteorological observations (Parey et al., 2010a), reanalysis data (Nogaj et al., 2006) and model simulations (Kharin et al., 2005(Kharin et al., , 2007 in order to quantify trends of extremes. 15 It has also been shown that the extremes of hot and cold temperatures are correlated with mean temperatures over the northern extra-tropics (Yiou et al., 2009). Until now, few models had provided daily output of temperature or precipitation on multi-century timescales (Jansen et al., 2007). However, with increasing storage capacity, daily resolution data is becoming more common and was requested for simulations in the CMIP5 20 archive (Yiou et al., 2012).
In the extra-tropics, seasonal summer heatwaves are generally preceded by droughts in the Winter-Spring seasons (Fischer et al., 2007;Vautard et al., 2007) with a mechanism that involves a positive feedback between sensible heat fluxes, evapotranspiration and temperature (Schår et al., 1999) and this has also been found in Introduction

Conclusions
References

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | high (or low) quantiles of the variable to be predicted, one can build regression coefficients conditional to high (or low) values of this variable (Koenker, 2005). We illustrate this diagnostic in Fig. 3, by computing the quantile regression for 90th and 10th quantiles of the summer hot day frequency and winter-spring precipitation frequency in the IPSL-CM5A-MR historical simulation and the E-OBS gridded dataset (Haylock et al., 5 2008). The quantile regression slopes illustrate the asymmetry of the precipitation or temperature dependence for hot or cool summers in Western Europe (Hirschi et al., 2011;Mueller and Seneviratne, 2012;Quesada et al., 2012;Seneviratne and Koster, 2012). The general picture is that a dry winter/spring tends to favor a hot summer. But while 10 wet winter-spring conditions are generally followed by cool summers (small spread between low and high quantiles), dry winter-spring conditions can be followed by cool summers as well as heatwaves (large spread between low and high quantiles), because the genesis of heatwaves can be broken in just a few days, due to fast variations of the synoptic atmospheric circulation (Hirschi et al., 2011;Quesada et al., 2012 Barriendos and Rodrigo, 2006;Camuffo et al., 2010). Hence, using a metric to capture heatwave dynamics is a promising approach to investigate major heatwaves that struck Europe during the last millennium, and to explore the relationship between summer temperature and winter-spring precipitation preconditioning, with different climate Introduction

Discriminating metrics
In this section we highlight metrics for which we have paleoclimate information that serve to discriminate between models that show different behaviours in future projections (or more idealised scenarios).

5
Projections of precipitation change in South America have a large spread in the CMIP3 archive (Meehl et al., 2007). In future projections, most models simulate a dipole of precipitation change in Northern South America, but the sign of this dipole depends on the model. If this feature is an intrinsic response in each model to a forcing, it might be possible to evaluate the dipole response in the paleo-climate simulations. 10 We define the precipitation dipole as the annual-mean precipitation averaged over  Figure 4 shows a strong link between precipitation changes in the future and pre-20 cipitation changes in the MH. Models in group 1 show a dipole in the MH which is similar to the dipole they simulate in the future, with a strong Southward shift of the ITCZ. In contrast, models of group 3 show instead a broadening of the ITCZ in the MH. Therefore, paleo-proxies of precipitation along the South American coast could help determine which group of models is the most realistic in the MH, and, by ex-Introduction everywhere except over Northeastern Brazil, a pattern that is most consistent with that simulated by the group 1 models. To gain confidence in such a paleo-constraint, we need to understand the physical processes that explain the common behavior between past and future. This preliminary analysis will not fully answer this question, but it does illustrate how to make use 5 of the wealth of past, future and idealized CMIP5 simulations. Table 1 shows a selection of correlations between precipitation changes and other model features. First, in the future climate, shifts in the ITCZ seem to be associated with shifts in the SST dipole in the Atlantic: models that shift the ITCZ the most southwards are those with the strongest warming south of the Equator relative to the rest of the Atlantic. ITCZ shifts in response to SST dipoles are expected (e.g. Kang et al., 2008). However, this relationship does not seem to hold for the MH to PI change. Second, the atmospheric component of the model also appears to play a key role. Some of the different model behaviors can be seen in amipFuture simulations, where all models are forced by the same pattern of SST warming. In addition, much of the different model behaviors can 15 already be seen in sstClim4xCO2 simulations, where a quadrupling of CO 2 is imposed with SST held constant. This is consistent with the fast response to CO 2 being an important component of the total precipitation response in global warming (e.g. Bala et al., 2009). Models that decrease precipitation over Northern South America in the projections and in the MH are those that decrease precipitation over this region under 20 4 × CO 2 . They also happen to be the models with the strongest land surface warming in response to both 4 × CO 2 and to MH forcing. Therefore, the different groups of models show different precipitation response to SST changes, orbital forcing and to 4 × CO 2 , but the response shows similarity between all these different forcings and within each model group. This suggests that common mechanisms are involved in the precipitation 25 response to all forcings, and that this is representative of each individual model. Finally, it is worth noting that models in group 3 often show the most significant "double ITCZ" problem in the Atlantic, an obvious, and persistent, common model bias.

LGM constraints on climate sensitivity
The LGM has been a prime target for assessments of climate sensitivity since it is a quasi-stable period with significant climate differences from today, with reasonably well-known boundary conditions and sufficient data to reconstruct large-scale climate shifts (e.g. Lorius et al., 1990;Edwards et al., 2007;Köhler et al., 2010;Schmittner 5 et al., 2011;PALAEOSENS, 2012). We can apply the methods described in Sect. 2 to estimate of the equilibrium climate sensitivity based on the CMIP5 LGM simulations. We use an ensemble of opportunity consisting of 7 models which participated in the PMIP2 experiment, together with 4 CMIP5 models for which sufficient data are available (at time of writing). Estimates of 10 the climate sensitivities of these models were obtained from a variety of sources and were derived using a range of methods. For the PMIP2/CMIP3 models, sensitivity was generally calculated using a slab ocean coupled to the atmospheric component (Meehl et al., 2007), whereas in CMIP5, the most readily available estimates use a regression based on a transient simulation (Andrews et al., 2012). These estimates are not per- 15 fectly commensurate, with some models reporting a 10 % difference in the two methods (Schmidt et al., 2013). Some of the PMIP2 models used for the LGM simulations may also differ from the equivalent CMIP3 versions for which the sensitivity estimates were made. Thus, the values used here may be somewhat inconsistent and imprecise, although we expect the uncertainty arising from these sources to be modest in compar- 20 ison to the range of values represented across the ensemble. The boundary conditions for the LGM simulations are essentially unchanged between PMIP2 and CMIP5 (save for changes in the shape of the imposed ice sheets), allowing us to consider these experiments as broadly equivalent (though there are some systematic biases, Kageyama et al., 2012). Limitations in the boundary conditions (such as the exclusion of dust discussed below, these results should be considered as a proof of concept rather than conclusive. The LGM was associated with a large negative radiative forcing anomaly with respect to the pre-industrial including substantially lower concentrations of greenhouse gases (e.g. Köhler et al., 2010). However, the ensemble does not show an expected negative 5 correlation between climate sensitivities and their globally averaged LGM temperature anomalies (over the full 100 yr of simulation output) (Fig. 5a, see also Crucifix, 2006). There is a strong negative correlation in the tropics, most strongly in the latitude band 10 • S-30 • N (Fig. 5b) (Hargreaves et al., 2012a). The correlation is weaker at higher latitudes where the feedbacks in response to large cryospheric changes may be very 10 different to those exhibited in a future warmer climate. There is also a strong positive correlation in the southern ocean (i.e., colder LGM anomalies are linked with lower sensitivity), possibly due to a large range of biases in the control climate (Fig. 5c). The correlation of piControl temperatures to sensitivity points to the Arctic and the southern oceans as regions where base climatology impacts sensitivity, probably via 15 cloud effects (see Trenberth and Fasullo, 2010, for a discussion). The strong negative correlation (r = −0.8) between the LGM temperature anomalies in the latitude band 10 • S-30 • N, and the climate sensitivities of the models (Fig. 6), is physically plausible, since this region is far from the cryospheric and sea ice changes of the LGM, and the forcing here is dominated by the reduction in greenhouse gas concentrations. 20 If we assume that the correlation with tropical temperatures provides a valid constraint on the real climate system, we can use this correlation to project an observational estimate of the past change onto the future, as in Boe et al. (2006). Recently, Annan and Hargreaves (2012) generated a new estimate of LGM temperature changes, based on a combination of several multiproxy data sets, and the ensemble of PMIP2 25 models. The method does not depend on the magnitude of changes estimated by the models, but only their spatial patterns. Using the resulting estimate of LGM temperature change in this latitude band of −2.2 ± 0.7 • C (at 90 % confidence), the predicted value for climate sensitivity arising from the correlation is 2. For a more explicitly Bayesian approach, we initially assign equal probability to each model in the ensemble. This choice can be questioned, given both the range of model complexities, and also the possible inter or intra-generational similarities between mod-5 els of related origins (Masson and Knutti, 2010). However, quantifying these issues is far from straightforward, so we make our choice for reasons of practicality and in order to demonstrate the utility of the overall method. A standard kernel density estimation based on the ensemble leads to the prior distribution presented as the green curve in Fig. 7, which has a 90 % range of 1.7-4.9 • C and a mean of 3.4 • C. The observationally- shifted to lower values with the mean reducing to 2.8 • C. Its 90 % probability range has only moved slightly, however, to 1.6-4.7 • C. The reason for the upper limit here remaining high is that the highest sensitivity model in the ensemble has been assigned a fairly large weight since it matches the reconstructions well. The small size of the ensemble means that this approach is rather sensitive to the presence or absence of particular 20 models in the ensemble.
The two approaches differ considerably in their use of the model ensemble. In the latter case, the ensemble is directly used as a prior estimate, which therefore imposes quite a strong constraint on climate sensitivity even before these observational constraints are used. The former method may be considered as roughly equivalent to us- 25 ing a prior that is uniform in the observed variable (here tropical temperature), although this approach is rarely presented in explicitly Bayesian terms. Despite the different assumptions and approaches, these methods both generate rather similar estimates for the climate sensitivity -both assigning highest probability towards the lower end of the semble size and possible naïvety of the assumptions made here, these estimates may not be robust and need to be tested using a larger ensemble.

5
The rate and pattern of Arctic sea ice change in the future is of key scientific interest due both to the surprisingly rapid changes currently occurring and the large spread in model estimates in, for instance, the onset of summertime "ice-free" conditions (Stroeve et al., 2012;Massonet et al., 2012). Recent studies (Mahlstein and Knutti, 2012;Abe et al., 2011) have demonstrated that 10 biases in sea ice volume have a strong impact on the simulated responses to radiative perturbations, and that there maybe a possibility to discriminate among models based on interannual modes of variability. The mid-Holocene simulations (driven mainly by changes in orbital forcing) may provide a orthogonal test of Arctic sea ice sensitivity. MH insolation changes imply that NH summers were warmer than summers today 15 (see Kutzbach, 1981, and many subsequent papers). Paleo-data from the circum-Arctic region indicates that this warmth was accompanied by reductions in sea ice extent at least during some months of the year (Dyke and Savelle, 2001;de Vernal et al., 2005;McKay et al., 2008;Funder et al., 2011;Polyak et al., 2010;Moros et al., 2006). The CMIP5 MH simulations (Fig. 8) consistently show decreases in sea ice extent 20 from August through to November. Changes in winter months are not coherent across the models, though these changes are not well characterised in the paleo-data either. There is a relationship (Fig. 9) between the size of the anomaly at the MH and in future projections, presumably reflecting the underlying sensitivity of the sea ice model and Arctic climate in general (see also O'ishi and Abe-Ouchi, 2011 anomaly to estimate the likely loss in future projections. However, it may also be possible to use more specific or local diagnostics to compare to a wider proxy network for a similar constraint (Tremblay et al., 2013).

Exploratory metrics and limitations
In this section we provide examples of where the paleo-climate information is ambigu-5 ous, or where connections seen in paleo-climate changes do not translate into the future for some reason. This may be related to forcing ambiguities, climate-change related divergence, or potentially, a misunderstanding of the dominant processes. While these examples are not directly informative about the future, they illustrate how the limitations of our outlined approaches can be explored in ways that illuminate key un-10 certainties.

20th-century changes in tropical Pacific climate
The response of the tropical Pacific Ocean to anthropogenic climate change is uncertain, partly because we do not fully understand how the region has responded to anthropogenic influences during the 20th century. Instrumentally based estimates of However, pseudocoral records calculated from CMIP3 historical simulations did not reproduce the magnitude of the secular trend, the change in mean state, or the change in ENSO-related variance observed in the coral network from 1890 to 1990. Similarly large discrepancies are present between CMIP5 simulations and the observations, 20 with none of the individual CMIP5 pseudocoral networks producing trends as strong as in the observed 20th Century coral records. While the observational coral network suggests a reduction in ENSO-related variance and an El Niño-like trend over the 20th century, CMIP3 and CMIP5 simulations vary greatly on both points. trend in CMIP3 and CMIP5 simulated pseudocorals is indistinguishable from the trends observed in individual centuries of an unforced control run (Fig. 10, upper panel). We also find that the trends in mean state and change in ENSO-related variance within the basin are highly variable among the CMIP5 models, and even between ensemble

Spectra and fluctuation analyses
As mentioned above, there is no restriction on what kind of variables, means, variances or higher-order statistics can be used in these analyses. In this section we highlight two analyses in the frequency domain that demonstrate the important role of relatively uncertain forcings in assessing skill. 5 In Fig. 11, we show the maximum-entropy method (MEM) spectra (using 30 poles) for the NH mean land surface temperature over 8 last-millennium simulations with the GISS-E2-R model that were run with different combinations of plausible solar, volcanic and land use forcings (Schmidt et al., 2011(Schmidt et al., , 2012. The spectra are similar for models that have the same volcanic forcing, and significantly different when the volcanic forcing is derived from a different dataset or where there is no volcanic forcing at all. Specifically, interannual to multi-decadal variability is much larger when volcanoes are imposed, and the larger the volcanic forcing, the greater the variability, with the largest response in simulations using the Gao et al. (2008) reconstruction, compared to the Crowley et al. (2008) reconstruction. In contrast, the difference between two different 15 solar forcings Steinhilber et al., 2009) is not detectable in this metric. (Note that the implementation of the Gao et al., 2008, volcanic forcing in these simulations was mis-specified and gave roughly twice the expected radiative forcing. Although part of the increase in variance seen here was unanticipated, given the uncertainties in specifying the forcing, the exercise is useful in highlighting the role of the 20 forcings in determining variance.) Another analysis in the spectral domain is one focused on power law scaling (Lovejoy and Schertzer, 1986). Several scaling studies of GCMs demonstrate that they generally simulate the statistics (including spectral scaling exponents) reasonably well up to ≈ 10 yr scales (e.g. Fraedrich and Blender, 2003;Zhu et al., 2006;Rybski et al., 2008 as deep ocean or land-ice dynamics which are currently missing or perhaps poorly represented in the models. Following , we calculate the Root Mean Square (RMS) fluctuation as a function of time-scale, from months to centuries, for the NH land temperatures using the same eight runs of the GISS-E2-R model used above for the period 1500-5 1900 CE. Since simulations are strongly clustered according to changes in the volcanic forcing used (Fig. 11), for simplicity we averaged over the three GRE and three CEA volcanic and the two no-volcanic runs.
For comparison, we show the mean of the same metric from three multiproxy reconstructions (Huang, 2004;Moberg et al., 2005;Ljundqvist, 2010). The multiproxy aver-10 age is processed with and without the 20th Century to indicate the importance of that period for the scaling behaviour -in all cases the variance in the multi-decadal to century scale is greatly enhanced by the recent anthropogenic trend. These curves show fluctuations decreasing with scale over the low frequency weather regime (months to decades) but increasing in the climate regime (decades to centuries). 15 The comparison with the GISS-E2-R simulations is illuminating. First, we note that at the decadal scale, the sign of the all the slopes changes. However, the simulations vary in the opposite direction from the data: first growing and then decreasing with scale. Only the volcano-free runs (bottom) qualitatively follow the reconstructions by first decreasing and then increasing with scale. When compared to the surface data 20 and multiproxy reconstructions we see that at ∼ 10 yr, the simulations have variance that is too large while at longer scales (> 100 yr) the variance is too small.
These results demonstrate clear mismatches in behaviour between the models' simulated variance at different scales and the inferred variability from multi-proxy reconstructions. However, there are strong sensitivities to the (uncertain) external forcing 25 functions, precluding a straightforward attribution of the mismatch to potentially misspecified forcings, missing mechanisms, insufficient "slow" variability or data problems in the reconstructions.

Hydroclimate divergence
Distinct from temperature, hydroclimate variability can be quantified using a range of variables, including precipitation, soil moisture, lake levels, or other synthetic indices (e.g. Nigam and Ruiz-Barradas, 2006). Most models provide output for these diagnostics, but often these variables are not available directly from paleo-climate archives, 5 creating a challenge when conducting model-data comparisons. However, calibrations of networks of precipitation sensitive tree ring widths have been used to reconstruct the Palmer Drought Severity Index (PDSI) in North America and Asia over the Common Era (Cook et al., 2004(Cook et al., , 2010. PDSI is calculated using temperaturederived estimates of the evapo-transpiration and precipitation, and nominally represents a normalized index of towards drying (GISS-E2-R) or even wetter conditions over the coming decades (MIROC-ESM). The reason for this divergence is in the treatment of evapotranspiration (ET) in the model soil moisture versus in the PDSI (Thornthwaite) calculation. In this PDSI calculation, temperature is used as a proxy for the energy available while in the GCMs the soil energy and moisture budgets are calculated directly using explicit 5 physical models. In reality, Thornthwaite ET becomes increasingly decoupled from temperature as the temperature increases, a factor reflected in the model soil moisture but not in the PDSI index. For time periods with strong transient forcing in temperature (e.g., the late twentieth century and into the future), our analysis suggests that the usefulness of PDSI for diagnosing drought and hydroclimate trends is limited. This 10 suggests caution should be used when trying to convert projected variables to those defined from the paleoclimate record.

Conclusions and recommendations
In this paper, we have focused the opportunities provided by inclusion of "out-ofsample" paleo-climate experiments within the CMIP5 framework, and specifically how 15 measures of skill in modelling paleo-climate change might inform future projections of climate change.
We have shown that some relationships are robust across the ensemble of models, simulations and paleo-data (Sect. 3) and furthermore that there are skill measures that are well correlated to the simulated magnitude of future change, thus allowing the likely 20 magnitude of future changes to be constrained (Sect. 4). However, there is a need for caution because of the limitations with models, the experimental setup used in CMIP5, or with the paleo-climate data itself (Sect. 5).
Our examples suggest that there are some general requirements for attempts to use the paleo-climate simulations to quantitatively constrain future projections. Each Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | of change, defines a metric of skill that quantifies the accuracy of the modeled changes and assesses the connection to a future prediction. We recommend that ideally: paleo-data targets be spatially representative synthesis products with wellcharacterised uncertainties, the chosen metrics shouqld be robust to uncertainties in external forcing,

5
the chosen metrics should not be overly sensitive to the model representation of key phenomena, and are within the scope of the modelled system, any relationship between the targets in the past and the future predictions should be examined, and not simply assumed.
Under these conditions, the likelihood of a significant constraint is much greater. 10 We underline the need for paleo-simulations to be performed with models that are also being used for future projections and that model diagnostics are commensurate (see also Schmidt, 2012). Although the robustness of some of our analyses is limited by the small number of paleo-simulations currently available in the CMIP5 database, we hope that the demonstration of their potential to address questions relevant to the future 15 should encourage other modeling groups to complete and archive these simulations.
There are also important lessons here for the paleo-data community. Our analyses rely heavily on the use of synthesis data products, for instance the MARGO dataset for the LGM (MARGO, 2009), pollen-based reconstructions for the Mid-Holocene (Bartlein et al., 2011), multi-proxy reconstructions of hemispheric temperature (e.g. Moberg 20 et al., 2005), or gridded tree-ring based reconstructions of PDSI for the last millennium (Cook et al., 2010). Such products are invaluable, but there is a need for increased transparency of included uncertainties and continued expansion e.g. see Müller et al. (2011) for sea ice extent. Increasing model complexity, for instance by including a carbon cycle, fire models or online tracers such as water isotopes, necessi-Introduction The periods and experiments chosen in paleo-climate experiments are far more limited than the number of interesting features in the paleo-climate record. The three periods selected for CMIP5 were chosen on the basis of their relative maturity (the existence of prior sets of experiments, already tested issues, existing data syntheses), but additional periods are also potentially useful -the mid-Pliocene (2.5 million yr ago), the 5 8.2 kyr event, the last interglacial, the peak Eocene etc. (see Schmidt, 2012 for justifications). Some of these periods are already being examined in a coordinated fashion (e.g. Haywood et al., 2012, andDolan et al., 2012, for the Pliocene), and it is to be hoped that more will be started. Further expansion of the model experiments will increasingly produce higher frequency diagnostics (daily and sub-daily variations), and 10 perturbed physics ensembles, to better characterise the model structural uncertainty. Both of these expansions will create possibilities for more, and better, tests of model performance. In the meantime, there is already a huge scope for more informative comparisons that can be made using the existing databases.