Spatio-temporal variability of Arctic summer temperatures over the past 2 millennia

. In this article, the ﬁrst spatially resolved and millennium-length summer (June–August) temperature reconstruction over the Arctic and sub-Arctic domain (north of 60 ◦ N) is presented. It is based on a set of 44 annually dated temperature-sensitive proxy archives of various types from the revised PAGES2k database supplemented with six new recently updated proxy records. As a major advance, an extension of the Bayesian BARCAST climate ﬁeld (CF) reconstruction technique provides a means to treat climate archives with dating uncertainties. This results not only in a more pre-cise reconstruction but additionally enables joint probabilistic constraints to be imposed on the chronologies of the used archives. The new seasonal CF reconstruction for the due to lack of proxy data and thus the most recent warming is not captured.

followed by a gradual cooling into the Little Ice Age (LIA), with 1766-1865 CE as the longest centennial-scale cold period, culminating around 1811-1820 CE for most of the target region.
In total over 600 independent realisations of the temperature CF were generated. As showcased for local and regional trends and temperature anomalies, operating in a probabilistic framework directly results in comprehensive uncertainty estimates, even for complex analyses. For the presented multi-scale trend analysis, for example, the spread in different paths across the reconstruction ensemble prevents a robust analysis of features at timescales shorter than ca. 30 years. For the spatial reconstruction, the benefit of using the spatially resolved reconstruction ensemble is demonstrated by focusing on the regional expression of the recent warming and the MCA. While our analysis shows that the peak MCA summer temperatures were as high as in the late 20th and early 21st centuries, the spatial coherence of extreme years over the last decades of the reconstruction (1980s onwards) seems unprecedented at least back until 750 CE. However, statistical testing could not provide conclusive support of the contemporary warming to exceed the peak of the MCA in terms of the pan-Arctic mean summer temperatures: the reconstruction cannot be extended reliably past 2002 CE

Introduction
During the past decades, the Arctic has experienced a more rapid and pronounced temperature increase than most other parts of the world. The dramatically shrinking extent of Arctic sea ice in recent years -with a decline in both minimum extent in summer and maximum area in winter -accompanied by a transition to a younger and thinner sea ice cover, is often interpreted as the clearest and most unambiguous evidence of anthropogenic global warming (Comiso et al., 2008;Perovich et al., 2008;Serreze et al., 2007;Maslanik et al., 2011;Meier et al., 2014). Additionally, the Arctic region is of utmost importance in the context of global climate and global climate change. Reduction in perennial sea ice cover leads to increased heat transport northward (Müller et al., 2012;Smedsrud et al., 2008) as well as changes the Arctic energy balance due to positive albedo feedbacks (Curry et al., 1995;Miller et al., 2010;Perovich et al., 2002Perovich et al., , 2011. Melting of permafrost can release methane (CH 4 ), a more efficient greenhouse gas than carbon dioxide (CO 2 ), and likewise gives a positive feedback that may further amplify the temperature increase (O'Connor et al., 2010;Shakhova et al., 2010). Even partial melting of the Greenland inland ice cap and/or the numerous smaller high-latitude glaciers would significantly raise the global sea level and threaten to flood low-laying coastal regions around the world (Grinsted et al., 2010;Vermeer and Rahmstorf, 2009).
The instrumental temperature record is too short and spatially sparse to assess whether this recent warming and the accompanying sea ice reduction experienced in the Arctic region so far fall outside the range of natural variability on centennial to millennial timescales. Moreover, general circulation models have limited capabilities in reliably simulating Arctic climate change on the centennial timescale and beyond (IPCC, 2013). The simplified parametrisation of dynamic and thermodynamic sea ice processes, and the limited skills in describing ocean -sea ice -atmosphere energy exchange, in particular in modelling polar clouds and oceanic heat flux, is especially evident from the lack of skill in reproducing the present-day rapid loss of Arctic sea ice (e.g. Hunke et al., 2010). Hence both the possible unprecedented nature of the ongoing Arctic warming during the Common Era (CE, the last 2000 years) and the relative role of anthropogenic and natural forcings driving the process are difficult to fully assess without a longer perspective from palaeoclimate proxy-based temperature reconstructions. Thus palaeoclimate data that can be used for understanding the range of natural climate variability in the Arctic region over long timescales are needed, together with methods that integrate different types of information from a variety of palaeoclimate archives.
Since the 1990s, several multi-proxy reconstructions of Arctic and sub-Arctic (usually 90-60 • N) temperatures have been published. The first one of those was the multi-proxy reconstruction by Overpeck et al. (1997), who compiled 29 proxy records from lake sediment, tree ring, glacier, and marine sediment records to present a decadally resolved uncalibrated index of temperature variability since 1600 CE. They found that the highest temperatures in the Arctic region since 1600 CE occurred after 1920 CE. Kaufman et al. (2009) published the first quantitative multi-proxy reconstruction of summer temperature variability in the Arctic (90-60 • N) during the past 2000 years at decadal resolution using the composite-plus-scaling method. This study concluded that the 20th century warming reverses a long-term orbitally driven summer cooling and that the mid-and late 20th century temperatures were the highest in the past 2 millennia. Shi et al. (2012) published the first annually resolved multi-proxy summer (June-August) temperature reconstruction for the Arctic region, extending back to 600 CE, based on a set of 22 proxy records with annual resolution. They utilised a novel ensemble reconstruction method that combined the traditional composite-plus-scale method -known to underestimate low-frequency variability (e.g. von Storch et al., 2004) -and the LOC (local regression) method of Christiansen (2011) that exaggerates the high-frequency variability (see e.g. Christiansen and Ljungqvist, 2017). The reconstructed amplitude of the centennial-scale summer temperature variability was rather dampened and found to be less than 0.5 • C but with large year-to-year and decade-to-decade variability. Shi et al. (2012) found a clear cold anomaly from 630 to 770 CE, a peak warming from ca. 950 to 1050 CE, and overall relatively cold conditions around ca. 1200-1900 CE. However, three distinctly warmer periods during the Little Ice Age (LIA) were reconstructed around ca. 1470-1510, 1550-1570, and 1750-1770CE. Contrary to Kaufman et al. (2009, Shi et al. (2012) found peak medieval Arctic summer temperatures in the 10th century to be approximately equal to recent Arctic summer temperatures. Tingley and Huybers (2013) used BARCAST (Bayesian Algorithm for Reconstructing Climate Anomalies in Space and Time; Tingley and Huybers, 2010a), a method based on Bayesian inference of hierarchical models (see also Sect. 3), to reconstruct surface air temperatures of the last 600 years over land north of 60 • N. Their reconstruction is mostly based on the proxy data set collected by the PAGES 2k Consortium (2013). They found that while the recent decades were the warmest over the last 600 years, the actual interannual variability has remained effectively constant. Much of the data (most of the ice core and lake sediment records) used therein are common with the work presented here, with a few updated records (see Sect. 2.2 and PAGES 2k Consortium, 2017) and a few additional proxies. Hanhijärvi et al. (2013) presented a 2000-year-long annual mean temperature reconstruction for the North Atlantic sec-tor of the Arctic (north of 60 • N and between 50 • W and 30 • E) using 27 proxy records of various types, resolution, and length employing the novel pairwise comparison (PaiCo) method. Their reconstruction reveals centennial-scale temperature variations of an amplitude of over 1 • C, with a distinct Roman Warm Period, warm Medieval Climate Anomaly (MCA), and 20th century warming. A somewhat indistinct Dark Age Cold Period is found in the middle of the first millennium CE, whereas a very clear and persistently cold LIA extends from the mid-13th century until the turn of the 20th century, with the lowest temperatures in the 19th century. Peak temperatures during the Roman Warm Period and the MCA were found to equal recent temperatures in the North Atlantic sector of the Arctic. The PAGES 2k Consortium (2013) extended the PaiCo reconstruction to cover the whole Arctic (60-90 • N), using 67 proxy records of various types, resolution, and length to reconstruct annual mean temperature variations over the past 2 millennia. They reconstructed a generally relatively warm first millennium CE, followed by a relatively indistinct MCA, and a relatively cold LIA from ca. 1250 CE to 1900 CE. The amplitude of the reconstructed low-frequency temperature variability in the whole Arctic by the PAGES 2k Consortium (2013) is smaller than that reconstructed for only the North Atlantic sector of the Arctic by Hanhijärvi et al. (2013). A revised Arctic2k reconstruction was subsequently published by McKay and Kaufman (2014), using an updated and corrected proxy database containing 59 records, showing a larger long-term cooling trend and being on average ca. 0.5 • C warmer prior to ca. 1250 CE than reported by PAGES 2k Consortium (2013). Peak temperatures during the Roman Warm Period and the MCA thus approximately equal recent temperatures in McKay and Kaufman (2014) as in Shi et al. (2012) and Hanhijärvi et al. (2013), instead of being much lower as in the Arctic2k reconstruction by the PAGES 2k Consortium (2013).
This study is mostly comparable with that of Tingley and Huybers (2013): our method is an update of theirs (Tingley and Huybers, 2010a;Werner and Tingley, 2015), and the proxy network is an update of the PAGES2k database (PAGES 2k Consortium, 2017). There are, however, a few notable differences: (i) the climate field (CF) reconstruction is performed on an equal area grid (land only), which should be more suitable for a spatially homogeneous process -especially at high latitudes. (ii) This target gridded instrumental data set is directly derived from meteorological observation data without any interpolation over grid cell boundaries. (iii) The gridded reconstruction is reliable back to 750 CE, and in principle goes back into the first millennium CE. (iv) The proxy data set is larger and more extensively screened, and (v) the age uncertainties of the proxies used are respected. Thus, the propagation of uncertainties from proxy data to the final reconstruction product is more complete. (vi) Additionally, while Tingley and Huybers (2013) use a single set of parameters for all proxies of one type, these are estimated here for each individual record. This potentially removes spurious precision at proxy sites responding less strongly to the seasonal temperature anomalies and should increase the precision at locations with a stronger climate response.

Instrumental data and proxy data
The following section provides a short overview of the instrumental and palaeoclimate proxy data used. The quality of the input data and their distribution in space and time play a strong role in the reconstruction process and for the reconstruction reliability (see e.g. Wang et al., 2015).

Instrumental data
Several different gridded data sets for Earth surface air temperatures (SATs) are available from different research groups, derived from different subsets of instrumental data and presented on different types of grids. Most data sets, like CRUTEM4 (Jones et al., 2012) or CRU TS3 (Harris et al., 2014), for example, are presented on a regular equilateral grid, such as a 5 • × 5 • grid. Such a regular grid exhibits severe shortcomings when analysing data close to the poles, as the grid cells become very narrow in the meridional direction and almost triangular shaped. One data set, the Berkeley Earth Surface Temperature (BEST) (Rohde et al., 2013), is offered on a 1 • × 1 • grid as well as an equal area grid. While the latter would be a good fit for the process level model of BARCAST (see Sect. 3), an analysis revealed that this version of the data set shows long-distance correlations over our region of interest that might be deemed unphysical: the correlation length is on the order of the region size and the cross correlation between grid cells contains oscillatory parts with respect to the distance between the cells. The latter, especially, might be artefacts of the regridding and interpolation process.
Thus a new gridded instrumental data set is generated for this study. The instrumental data for the CRU TS3.24 (Harris et al., 2014) data set were downloaded from the CRU website (http://browse.ceda.ac.uk/browse/badc/cru/data/cru_ ts/cru_ts_3.24.01, last access: 7 March 2017). First, the data were converted into anomalies for the period 1801-2016 CE, using the method of Tingley (2012). The equal area target grid is taken from Leopardi (2006). To construct the gridded data, the instrumental data within each grid cell were averaged, using the variance adjustment scheme described by Frank et al. (2006). We aimed at retaining the variability that a single instrumental record in the grid cell would exhibit. This is a compromise between an actual grid-cellwide average and the limited spatio-temporal availability of instrumental data in high latitudes. In contrast to other regridding methods, no data were shared across grid cells by a prescribed spatial covariance structure or spatial interpolation algorithm. In contrast to Tingley and Huybers (2010a, b) and Werner et al. (2013)   and first year (coloured circles) of the regridded instrumental data. Symbols show the locations and type of proxy data used (PAGES 2k Consortium, 2017). The reconstruction target area is all grid cells marked with wire frames.
malised to have unit standard deviation prior to running the BARCAST sampler. As can be seen in Fig. 1, the resulting instrumental data set is very sparse in space and time. While ordinary reconstruction methods would indeed struggle with such input data, the advantage of the BARCAST method and the extension used here is that presence and absence of observations are explicitly modelled. The reconstruction target region is grid cells (wire frames in Fig. 1) that only contain land mass (continent and islands). This is necessary due to the constraints of the chosen reconstruction method (Tingley and Huybers, 2010a;Werner and Tingley, 2015), more specifically due the homogeneous process level model, which describes the temperature evolution on the grid cell level.

Proxy data
The proxy records (black symbols in Fig. 1) mostly come from the current version (2.0.0) of the PAGES 2k Consortium (2017) temperature database, with six recently updated ice core records from Greenland with revised and synchronised chronologies. The data set contains several types of natural archives (tree rings, ice cores, and marine or terrestrial sediments) and proxy measurements (such as ring width and stable isotopes). Thus the data are sensitive to different seasons, and on different timescales -partly due to different resolutions and the evaluation procedures but also owed to the processes generating the archives. All data north of 60 • N contained in the database were selected, with an a priori aim of including all annually resolved records.
As the PAGES 2k Consortium (2017) set out to generate a very inclusive data set, the need arose to again scrutinise the data. A few records were excluded (see Table D1), as they did not meet the required response characteristics on actual annual timescales. Additionally, data were divided into two classes: absolutely and precisely dated tree ring chronologies, and layer-counted proxies with age uncertainties. The latter comprises varved lacustrine sediments and ice core data. In contrast to the procedure outlined by Luterbacher et al. (2016), tree ring width measurements are not treated differently from maximum latewood density data, although the spectral properties should in principle warrant this separation (Zhang et al., 2015;Esper et al., 2015;Büntgen et al., 2015).
All of the proxy records used in this study are derived from annually banded archives. While tree ring records are compiled by cross-referencing a number of cores for each period, there is usually very limited replication of ice cores or varved lake sediments. Thus, these archives can (and usually do) contain age uncertainties (see Sigl et al., 2015) which need to be taken into account. Fortunately, the chosen method (Werner and Tingley, 2015) is able to deal with this issue, provided an ensemble of age models is given for each proxy. Appendix D details how these age models are generated. As the majority of the proxy data are more sensitive to summer or growing season temperatures, the target season for the reconstruction is the climatological summer season (the months from June to August, JJA) rather than the annual mean temperature. Werner and Tingley (2015) published an extension to the BARCAST method. It extends the work of Tingley and Huybers (2010a), providing a means to treat climate archives with dating uncertainties. The original method has been used in a collection of pseudo-proxy experiments (Tingley and Huybers, 2010b;Werner et al., 2013;Gómez-Navarro et al., 2015), as well as CF reconstructions over the Arctic (Tingley and Huybers, 2013), Europe (Luterbacher et al., 2016), and Asia (Zhang et al., 2018).

Reconstruction method: BARCAST+AMS
The method uses a hierarchy of stochastic models to describe the spatio-temporal evolution of the target CF (here: temperatures) C t ∈ R N at N different locations throughout time t, and the dependence of the observations O t ∈ R N (proxy data as well as instrumental data) on it: The process level is thus AR(1) (first-order autoregressive) in time, with an overall mean µ and the coefficient α modelling the temporal persistence. The year-to-year (or rather summer-to-summer) innovations have an exponentially (with distance between locations x i and x j ) decreasing spatial persistence that is homogeneous in space. The spatial e-folding distance is 1/φ. The climate is thus persistent in space and time, and information is shared across these dimensions. This is critical in constraining age models (see discussion in Werner and Tingley, 2015). The climate process C is never directly observed without error (latent process). The observations are modelled as a noisy linear response function: where the element-wise product (Hadamard product) is denoted by . The parameters (β 0 , β 1 , τ 2 ) are vectors and thus different for each location with observations, while in the past one set of parameters was assigned to each proxy type (e.g. tree ring widths, ice layer thickness, or isotopic values) (Tingley and Huybers, 2013). The instrumental observations are assumed to be unbiased and on the correct scale, so that, for this type of observation β 0 = 0 and β 1 = 1. The selection matrix H t is composed of zeros and ones and selects at time step t the locations for which there are proxy observations of a given type. That is, each proxy observation is assumed to be linear in the corresponding local, in time and space, value of the climate. While interannual temperatures roughly follow a normal distribution in our target region (Tingley and Huybers, 2013), a variable-like varve thickness is positive only. These variables are transformed using inverse quantile transformation (e.g. Emile-Geay and  to include them easily in the reconstruction. This data-level model is then refined to include dating uncertainties. To this end, Werner and Tingley (2015) consider the dependence of the local observations O s on the local climate: The vector e s is a time series of independent normal errors at location s (see e t from Eq. (1b)). In analogy to H t in Eq. (1b), T s is a selection matrix of zeros and ones that picks out the elements of the vector C s corresponding to elements of O s and is dependent on the age depth model (ADM) T .
From these model equations, conditional posteriors for the CF and all of the parameters (CF and instrumental or proxy observations) are calculated. Then, a Metropolis-coupled Markov chain Monte Carlo (MC) 3 sampler (Altekar et al., 2004;Earl and Deem, 2005;Li et al., 2009) is used to iteratively draw solutions from these posteriors; see Tingley and Huybers (2010a) and Werner and Tingley (2015) for details and implementations. In the version implemented here, BARCAST is slightly modified. While Tingley and Huybers (2013) used a single set of response parameters (β 0 , β 1 , τ 2 ) for all data of one type, and Luterbacher et al. (2016) actually set up a separate observation matrix with a set of parameters for each single proxy, the code is updated for this study. Here, the response parameters of Eq. (1b) are now vectors.
While this slows down the computations and also the convergence there is no good reason to assume that all proxies of one type respond in the same way across the whole domain and with no differences in proxy quality. This has been discussed already by Luterbacher et al. (2016), where two proxies that were initially in the PAGES2k database (PAGES 2k Consortium, 2013) proved to contain no clear temperature signal (see also changes in the updated database PAGES 2k Consortium, 2017) and were thus removed.
The reconstruction code is run in four chains for 8000 iterations without the age model selection code enabled. By then, the chains have clearly settled to a stable state, and the potential scale reduction factor (Gelman'sR) indicates convergence of the parameters (|R − 1| < 0.1). Then, the (MC) 3 code of Werner and Tingley (2015) is enabled, and the age models are varied. While this was not necessary in the work of Werner and Tingley (2015), the real world data are much sparser and noisier and do not follow the exact prescribed stochastic model Eq. (1a-1c). While this additional step helps speed up convergence it can cause the algorithm to strongly favour one set of age models. This can be checked by analysing the mixing properties over the age models in the heated chains (see discussion in Werner and Tingley, 2015). As noted therein, there is a tradeoff between the switching efficiency and the number of chains. With the infrastructure used, four chains using two cores each (for parallel linear algebra using the OpenBLAS library http://www.openblas.net) were deemed a reasonable compromise.

Reconstruction quality
The reconstruction calibration and validation statistics are shown in Appendix A. It has been shown that some of the commonly used measures, like the coefficient of efficiency and the reduction of error (Cook et al., 1994) are not proper scoring rules and should be avoided in such an ensemblebased probabilistic framework (Gneiting and Raftery, 2007). Thus, reconstruction skill is assessed using the CRPS pot (potential average continuously ranked probability score, which is akin to the mean absolute error of a deterministic forecast; see Gneiting and Raftery, 2007) as well as the reliability score (the validity of the uncertainty bands; Hersbach, 2000). Additionally, a probabilistic ensemble-based version of the coefficient of efficiency and the reduction of error are constructed from these (see Appendix A).
Both the CRPS pot as well as the reliability score show a decent reconstruction quality ( Fig. A1 top row). The CRPS pot on average shows a mismatch of 0.2 • C (0.4 • C) in the calibration (validation) interval and the Reliability is mostly better than 0.2 • C. This is on the order of the noise strength that the reconstruction code attributed to the instrumental observations. Additionally, the probabilistic ensemble-based version of the coefficient of efficiency and the reduction of error show a skilful reconstruction in most grid cells containing instrumental temperature data -at least in regions where proxy and instrumental data are present over most of the validation period. Note that the quality of the instrumental data, or rather the representativeness of (often) a single meteorological station record can be debated. In fact, in contrast to other BARCAST-based reconstructions the one presented here shows a substantial (τ 2 I ≈ 0.25 • C 2 ) noise level for the instrumental data. As other gridded instrumental data sets employ spatial interpolation processes, these are generally smoother in space than the gridded instrumental data set generated for this study. Thus these gridded products are by design closer to the spatial characteristics of the process model in Eq. (1a).
Another means of assessing the reconstruction quality is to check the variability in or spread of the different ensemble members in space and time (see Appendix B). The effect of the spatially and temporally sparse data can easily be seen in Figs. B1 and B2, clearly indicating the increased uncertainties back in time and in space in the absence of proxy data. This analysis hints that while there could still be skill left in the mean Arctic summer temperature reconstruction in the first centuries CE, the precision of the spatial reconstruction rapidly decreases in areas that become more data sparse. While the reconstruction over the regions with local proxy data present -such as Fennoscandia -remains reliable, a time-varying reconstruction domain (or rather, domain over which the reconstruction is analysed) would be beyond the scope of this paper. Thus the gridded reconstruction is only shown back to 750 CE. However, for analyses over data-rich regions such as the North Atlantic sector, the full reconstruction period (1-2002 CE) can in principle be used. For a more uniform reconstruction skill back in time, more high-quality proxy data would be needed. As a first estimate, the distance between neighbouring proxy locations should be less than the spatial correlation length of the system, which is estimated to be around 1500 km.
Additionally, the spectral properties of both the reconstruction and the proxy input data are analysed (see Appendix C). Not all proxies contain signal on centennial or longer timescales, and the reconstruction method explicitly describes year-to-year summer temperatures as an AR (1) process. Despite the reconstruction showing properties of an AR(1) process over most of the reconstruction domain (see also Nilsen et al., 2018) the area-averaged temperature reconstruction exhibits similar variability on centennial and longer timescales as other multi-proxy reconstructions over the Arctic (see Fig. 3).

Results
In the following sections the resulting reconstruction is presented. First, the Arctic average is analysed and compared against other studies from the same region. Two periods of interest are identified in the reconstruction before the instrumental period: the warm MCA around 920-1060 CE and the following LIA, which in this reconstruction culminated in the early 19th century. These then provide a context for the current warming of the Arctic. A detailed analysis of an earlier extended warm period in the fourth and fifth centuries CE is omitted due to a higher uncertainty of the derived reconstruction prior to 750 CE. Yet it is acknowledged that the scales of the detected warming could be comparable to the following episodes that occurred later during the MCA. Finally, the spatial variability in the reconstructed temperature field is explored, with a focus on the most extreme periods.

Mean Arctic results
The ensemble mean of the area-averaged summer temperature reconstruction is shown in the bottom half of Fig. 2 as the pointwise (year-to-year summers) ensemble mean (heavy blue line). The first millennium CE shows a mean reconstruction that exhibits an apparent change in variability. This is caused by the increased variability between the different ensemble members and thus by the reduction in proxy data coverage back in time. The effect of the spatial proxy data coverage on the reconstruction intra-ensemble variance is further discussed in Appendix B.
The new spatially averaged SAT reconstruction shows a pronounced variability on a broad range of timescales. The longer-term, centennial to millennial, evolution of the reconstructed SAT demonstrates a reasonably good agreement with a general pattern that was inferred in previous temperature reconstructions for the Arctic and its subregions (Fig. 3). Throughout most of the reconstruction period, the Arctic SAT anomaly shows an overall orbital-forcing-driven cooling trend.
Superimposed on the trend are three major centennial to multi-centennial scale anomalies: a warm period in the fourth and fifth centuries CE, the MCA -a warm period with a diverse spatial expression in the Northern Hemisphere during the 9th to 12th centuries CE, and the two phases of the cold LIA between ca. 1100-1450 CE and 1600-1900 CE. Figure 4 further demonstrates that the three aforementioned major climate anomalies together with the most recent period are associated with the likely warmest and coldest centuries in the Arctic over the last 2000 years. In particular, the 17th and 19th centuries CE with (within the uncertainty) similar ranged mean SAT anomalies of −0.9 ± 0.1 • C appear coldest in the ensemble average ( Fig. 4a) and are also ranked coldest in 52 and 25 % of the ensemble members, respectively. While the fifth century with a SAT anomaly of 0.1±0.2 • C appears warmest in 48 % of reconstruction members over the 2000 years, this inference should be considered with caution due to a higher reconstruction uncertainty for this data-sparse pre-750 CE period. For the later periods with better proxy coverage, the 10th century CE, accommodating the MCA with the ensemble average mean SAT anomaly of 0.0 ± 0.1 • C, along with the 20th century SAT anomaly of 0.0 ± 0.05 • C share the rank of the two warmest centuries over the last 1200 years in the Arctic, which is in line with other studies (e.g. Ljungqvist et al., 2012Ljungqvist et al., , 2016. The slow millennial-scale cooling is finally terminated by the contemporary warming, which is clearly identifiable since the middle of the 19th century. Figure 3 suggests that the LIA cooling is less pronounced in the new reconstruction compared with the same period reconstructed by McKay and Kaufman (2014), though the uncertainty intervals mostly overlap with their mean results. A likely explanation of this difference is the effect of targeting the summer season (as in our study) compared to annual mean in the reconstruction of McKay and Kaufman (2014). Throughout the LIA, sea ice cover has most likely experienced a pan-Arctic expansion as evidenced by proxy studies (e.g. Belt et al., 2010;Kinnard et al., 2011;Berben et al., 2014;Miettinen et al., 2015) and also supported by documentary evidence for the last phase of the LIA (Divine and Dick, 2006;Walsh et al., 2017). Such sea ice expansion would lead to an increased continentality of the climate in most of the study domain, implying larger summer to winter SAT contrasts (see e.g. Grinsted et al., 2006, for Svalbard). This has potential effects on differently targeted reconstructions and the inferred magnitude of LIA cooling. The new reconstruction, however, shows larger lowfrequency temperature variability than those reconstructed by Shi et al. (2012) or Tingley and Huybers (2013).
The transient features in the spatial mean reconstruction ensemble are analysed with the modified scale space method SiZer (Significance of Zero Crossings of the Derivative) (Chaudhuri and Marron, 1997). The original technique uses a local linear regression kernel-based estimator to produce a family of non-parametric smooth curves for the target data series for a range of kernel bandwidths (h). Assessment of Year CE T anomaly °C Kaufman et al. (2009) McKay and Kaufman (2014) Shi et al. (2012) Tingley and Huybers (2013) Hanhijärvi et al. (2013) This study the statistical significance of the scale-dependent features in the observed data, such as the local linear trend estimate, is then provided based on the inferred variability in the data and the quantile specified. The original SiZer summarises the data analysis results in a map which highlights the locations in scale (here the variability timescale) and space (here the point in time), where the slope of a smoothed version of the unknown true underlying curve is significantly positive or negative. The modification of SiZer used in this paper utilises the additional amount of information that is available via the ensemble of reconstructions. As the analysis is repeated for all individual members of the reconstruction ensemble, both the variability in the estimated slope of the smoothed curve and the spread in slope significance for a certain scale and point in time can be tested. This approach therefore enables the assessment of the robustness of features detected as significant to be made across the entire range of independent and equally likely reconstructions. Figure 5a presents results of the analysis, highlighting the timescales and periods for which at least 90 % of the ensemble members exhibit statistically significant changes. Results suggest that given the proxy network configuration and BARCAST settings used, the overall millennial-scale cooling trend as well as the MCA, LIA, and Contemporary Warm Period (CWP), appear as statistically significant features in the majority of the ensemble members. The MCA-to-LIA transition together with the onset of the CWP are the two coherent changes apparent on the broad range of timescales considered, down to a multi-decadal scale. The initial phase of the LIA-related cooling is flagged significant at a range of centennial to nearly millennial timescales. It is relatively early, centred already on 1030 CE. We note also that for the centennial timescale, Fig. 5a points to the onset of a statistically significant warming during the 1840s CE. This would justify using 1850 CE as the cut-off year for inferring the longer-term tendencies in the reconstruction prior to the CWP. Later, the period of 1917-1928 CE marks an ensemble-coherent warming trend in the terrestrial Arctic on the scale of about 30 years, which clearly links it to the early 20th century warming.
The statistically significant changes that are coherent across the reconstruction ensemble are four cooling and one warming episode revealed at the timescales of 30-100 years and centred at 1450 CE, 1591 CE, 1669 CE, 1810 CE (cooling), and 1477 CE (warming). In order to assess the magnitude and timings of the most rapid changes for the two se- The two largest statistically significant cooling rates in the entire ensemble with average temperature changes of −1.5 ± 0.4 • C and −1.1 ± 0.4 • C over 3 decades are registered at 1450 CE and 1669 CE, respectively, while a recovery after the first cooling centred at 1477 CE featured a warming of 1.2 ± 0.4 • C over a similar 30-year time period. In terms of the rate of changes attained, the first cooling-warming episode appears unique over the 2000-year-long reconstruction, including one of the coldest decades in the reconstruction ensemble. At the highlighted centennial timescale, the most rapid changes are the MCA-to-LIA transition with a cooling of −0.8 ± 0.3 • C centred at 1040 CE in line with glacier evidence on Svalbard (van der Bilt et al., 2015;Bakke et al., 2017), the cooling towards one of the LIA SAT minima at 1577 CE with −0.7 ± 0.2 • C, followed by the transi-tion to the CWP centred at 1905 CE with an average warming of 1.2 ± 0.2 • C over ca. 120 years, which is also the largest centennial-scale warming rate detected in the entire ensemble. Note that the intra-ensemble variations hinder a robust detection of statistically significant changes common for the majority of the spatial mean reconstruction ensemble members at the timescales shorter than 3 decades, with the cooling towards the absolute decadal minimum of the record at 1811-1820 CE being the only remarkable exception. The same applies to the pre-750 CE period that appears highly variable on a range of scales when the reconstructions are considered individually, but shows no single episode that is localised in time across all ensemble members. The latter is related to a much reduced density of the multi-proxy network for the considered period (see discussion in Werner et al., 2013), and due to the age model selection code, which would delocalise events in time (see Werner and Tingley, 2015, for details). Given the sparse proxy network before 750 CE, and a correlation length on the order of 1500 km, this clearly highlights the need for proxy data to be extended back in time.

Spatial trends over the Arctic
The analysis of the gridded reconstruction in the spatial domain is, in general, limited to proxy-rich periods and regions, especially on shorter timescales. Thus, the gridded reconstruction is mostly analysed back to 750 CE for the whole Arctic region. For a spatial subset that has a better proxy coverage back in time, such as the Atlantic sector, the whole reconstruction could be used (see Appendix B).
Since most other studies have analysed the temperature trends over the Arctic for the full period of 1-1850 CE, the spatial pattern of millennial-scale trends in the reconstructed Arctic SAT is depicted in Fig. 6, both back to 1 CE, and limited back to 750 CE. Note that there are large gaps in the proxy coverage for the early period, which result in nonsignificant and likely reduced trends over those parts of the target region. The results for the magnitude of the millennialscale cooling in the spatial mean reconstruction are in line with the previous studies, although the new reconstruction tends to agree best with McKay and Kaufman (2014) (see Fig. 6b). While for the ensemble and the reconstruction domain average the rate of cooling attains −0.05 ± 0.01 • C per century, which results in an overall temperature decrease of about −0.9 • C during 1-1850 CE, the analysis reveals that this long-term cooling trend seems spatially inhomogeneous. In particular, the largest magnitude of the millennial-scale cooling of up to −0.13 ± 0.02 • C per century yielding a temperature decrease of −2.4 • C over the period of 1-1850 CE is registered in the region between 0-30 • E and 10-170 • W, and only this domain actually contains proxies covering the full CE. Averaging over the longitudes similarly suggests that the largest cooling over the period preceding the contemporary warming has likely occurred in the region encompassing Greenland and the Canadian Arctic between 30 and 120 • W. At the same time, much less pronounced negative trends with an overall cooling of less than −0.4 • C over 1-1850 CE are detected in most of the Eurasian region within 30-180 • E. This is statistically significant only in a few locations. The results outside the proxy-data-rich regions are mostly reflective of the overall mean cooling trend of the remaining proxies; any in-depth analysis needs (by design of the reconstruction method) to be limited to locations closer than about one or at most two e-folding lengths (ca. 1500 km) to the proxy data.
Since the analysis of the gridded reconstruction is limited to the time after 750 CE, the results above need to be interpreted carefully, especially more than about 1500 km away from any proxy data. Thus, the bottom half of Fig. 6 presents a similar analysis for the time span of 750-1850 CE. The revealed pattern suggests a more even cooling throughout the reconstruction domain with circum-Arctic trend magnitudes similar within the uncertainty estimate.

Contemporary Arctic warming in the context of MCA and LIA climate anomalies
Comparing the magnitude and spatial extent of past warm periods featuring similar settings in external forcing with the present-day warming is of major importance since it provides possible limits for the scales of naturally forced climate fluctuations. Figures 2 and 3 suggest that in the new reconstruction the period of 900-1050 CE, typically associated with the peak of the MCA, shows up at least similarly warm as the reconstruction for the late 20th and early 21st centuries, although the instrumental data suggest much warmer temperatures in the last decade (2006. This is in accordance with the conclusions reached previously in Shi et al. (2012), Hanhijärvi et al. (2013), and McKay and Kaufman (2014). The Arctic mean SAT reconstruction before about 750 CE has much higher uncertainties, and robustly identifying warm periods becomes more difficult. Especially in contrast to the reconstruction by Hanhijärvi et al. (2013) the Roman times around the first and second centuries CE do not show up as particularly warm in the circum-Arctic mean, which is also reflected in the analyses presented in the previous section. Note that their reconstruction was limited to the North Atlantic sector of the Arctic, and thus a direct comparison is difficult. Additionally the spatial skill of the reconstruction decreases back in time as the proxy data become sparser (see Appendix B), and spatial averages thus result in higher uncertainties and the ensemble average will be closer to the overall mean. Taking these uncertainties into consideration, the focus will thus be on comparing the more tightly constrained MCA and LIA anomalies with the contemporary warm period. The warmest century-long period of the mean SAT reconstruction after 750 CE associated with the MCA occupies most of the 10th century CE (927-1026 CE). The peak decade-long warmth of the MCA occurred during 926-935 CE, when the reconstructed spatial mean SAT anomaly attains 0.48 ± 0.31 • C. The timing of the coldest centennialscale period of the LIA, specifically 1766-1865 CE, broadly associates it with a Dalton grand solar minimum. This period also contains the coldest decadal-long event in the reconstruction detected during 1811-1820 CE with the mean Arctic ensemble-based temperature anomaly of −1.5 ± 0.2 • C. This cold decade also coincides with the period of increased volcanic activity, with two major tropical eruptions of 1808/1809 CE and Tambora in 1815 CE. The second coldest decade of the LIA with the SAT anomaly of −1.4±0.2 • C likely occurred during 1463-1472 CE, also following strong volcanic forcing. Figure 2a presents the ensemble-based probability density functions (PDFs) of the mean SAT anomalies spatially averaged across the reconstruction domain for the six selected reconstruction sub-periods. These are the three selected century-long periods of 927-1026CE, 1766-1865CE, and 1903-2002 representing both the aforementioned warmest and the coldest century-long periods of the record after 750 CE as well as the last century-long period, the second warmest of the reconstruction, which includes the CWP (in this study since 1978 CE onwards). For comparison, the same PDF for the entire reconstruction period is also presented. To further highlight the contrasts between the mean and extreme climate states, PDFs for the three shorter decadal-scale intervals corresponding to the anomalously warm and cold periods of 926-935 CE, 1811-1820CE, and 1993-2002.4 for details) are displayed. The chosen decade of the CWP is the second warmest on average in the record after the MCA in the considered reconstruction period with a SAT anomaly of 0.41 ± 0.28 • C, followed in rank by a warm decade of the early 20th century warming (from 1930 to 1939 CE; not shown here). The maps of spatial mean SAT anomalies for these periods follow in Fig. 7.
Comparing the coldest phase of the LIA with a mean centennial-scale SAT anomaly of −0.94 ± 0.09 • C vs. the MCA 0.07 ± 0.13 • C and the last century of the reconstruction (SAT anomaly: 0.01±0.05 • C) emphasise the difference between the extreme warm and cold century-long periods in terms of the pan-Arctic summer temperature probability density. Figure  1990s likely introduces a cold bias when estimating presentday warming in the reconstruction.
In order to quantitatively test the significance of the observed reconstructed differences in SAT anomalies between the selected periods, the two-sample t test is used on the samples of the derived distributions. During the testing procedure the realisations from different ensemble members of the Arctic SAT annual means are not pooled. Rather, the respective PDFs for the selected periods are derived for every individual ensemble member of the reconstructed SAT. The procedure uses bootstrap estimates of the PDF for the period (MCA and CWP) averages derived from 100 independent draws. The two-sample t test with separate variances is applied to test the null hypothesis of the two samples associated with the two different warm periods to originate from two normal distributions with equal means and unknown and non-equal variances. Using a one-tailed t test should then provide information on whether the MCA was on average warmer or colder than the last 100 years. The test statistics for each ensemble member are then collected and analysed. The testing results for a two-tailed test with unequal variances rejected H 0 of equal Arctic mean SAT anomalies between 927-1026 CE and 1903-2002 CE for 93 % of ensemble members. However, when considering hypotheses with a one-tailed test, no conclusive answer can be reached. Although the MCA appears slightly warmer on a centennial timescale compared with the last 100 years as shown in Fig. 2a, testing rejects H 0 for 64 % of the ensemble members only, whereas for the opposite alternative hypothesis (i.e. 1903-2002 CE warmer than MCA on average) the H 0 rejection rate is as high as 29 %.
Though this result somewhat favours the alternative hypothesis of SAT MCA > SAT CWP , the difference in the rejection rates appears negligible. We therefore conclude that given the collection of the proxy and instrumental data, and the reconstruction technique used, it is not possible to infer whether the Arctic summers of the last 100 years of the reconstruction (i.e. before 2002 CE) were unprecedentedly warm when compared with the previous major warm climate anomaly back to 750 CE. We also note that higher variability in the derived ensemble of realisations for the mean Arctic SAT anomaly during the warmest decade-long intervals of the MCA and CWP similarly prevents reaching any firm inference on the relative magnitudes of the two decade-long anomalously warm periods of the new reconstruction.

Spatial signature of past and recent extreme temperature anomalies
The distribution of extremely warm and cold years in both space and time is analysed by ranking the years according to their seasonal temperature for each ensemble member and the reconstruction node. Due to insufficient proxy data density and hence the inflated intra-ensemble variance (see Fig. B1) in the early part of the reconstruction period, the analysis is limited to the time after 750 CE. For the Arctic average the probability density for each year to be ranked as warmest or coldest is calculated across the entire ensemble using the spatial mean SAT. To check the statistical significance of the derived probability densities, the analysis is replicated on an ensemble of surrogates derived from the original reconstruction ensemble using block bootstrapping of the spatial mean reconstructions along the time axis. The block size of 10 years was assigned using an ensemble-average first-order autocorrelation coefficient of 0.8 and Mitchell et al. (1966) formula with adjustment of Nychka et al. (2000), yielding the efficient number of degrees of freedom in the data of about 125. The derived time-average 0.975 percentile of the probability density for the bootstrap surrogates is then used as the respective quantile for marking the years as potentially coldest or warmest during the analysis period (Fig. 8). In order to highlight a decadal-scale variability in occurrence of warm and cold extremes, the fractions of potentially warmest or coldest years per decade are calculated in sliding 10-year-long windows. In order to reduce the effects of the reconstruction uncertainties, the reconstruction is averaged over 5 • longitudinal bins. To take the spatio-temporal autocorrelation into account during significance testing, the bootstrap replicates are drawn as 10year-long time slices from individual reconstruction ensem-  Fig. 9b. The results of the analyses are reflective of the longer-term (millennial and secular) pan-Arctic tendencies in the seasonal SAT, yet the inter-regional differences are made clear as well. Of the series of past and present exceptional warmings, compared with the part of the present-day warm period before 2002 CE, the peak of the MCA features the two phases of a pronounced pan-Arctic warming with a consecutive series of spatially coherent warm extremes between ca. 920 and 970 CE (Fig. 9b). On a decadal timescale (Fig. 9a) the MCA is marked over the whole region by anomalies having a persistently high fraction of likely warmest decades with no decades containing a year ranked as coldest.
In particular, 7 out of 10 years during the decade of 926-935 CE resulting in the first MCA sequence of warm extremes and 6 out of 10 years of 954-965 CE during the second maximum were ranked as statistically significant warm extremes among the ensemble members. Note that sequences of potentially warmest years and hence decades with a higher fraction of extremes are also detected before 800 CE, though the eighth century does not appear in the reconstruction as particularly warm on average. Figure 9b also highlights a difference in time evolution of the regional expression of the MCA via the spatial incoherence of extremely warm years or decades in this overall warm period. A somewhat earlier onset of warming in the European to Asian domain is evident from an increased frequency of warm extremes east of the zero meridian around 920 CE, followed by a coherent warming in the Greenland and North Atlantic sector of the study domain. Figure 9b also suggests that a second phase of the MCA could mainly be localised west of the prime meridian. Figure 7d exemplifies a picture of a pan-Arctic warming during the first warm decade (926-935 CE) of the first phase of the MCA with the largest reconstructed positive anomalies attained within the 170 • W-30 • E domain and 7 of 10 years ranked as potentially warmest in the reconstruction ensemble. A sequence of less pronounced MCA warm extremes occurred between 980 and 1040 CE localised primarily within the Atlantic sector (Greenland and Europe) of the study domain and do not exhibit as clear temporal coherence as the two main phases of the MCA. Figures 8 and B1 demonstrate that the period after the MCA termination features a variable climate as manifested by an alternating sequence of potentially warmest and coldest years detected on the regional scale. Yet there is a pronounced lack of the pan-Arctic warm extremes, with a short exception of a 15-year-long warmth centred at 1142 CE. The following transition into the cold LIA is clearly marked by a drop in the frequency of potentially warmest years or decades to zero. During the LIA the cold extremes dominate on both the regional (Fig. 9b) and pan-Arctic scales (Fig. B1) until the onset of the contemporary warming after ca. 1880 CE. The first potentially warmest year after the termination of the MCA is detected only in 1983 CE, also indi-cating that a summer manifestation of the early 20th century warming in the terrestrial Arctic (e.g. Yamanouchi, 2011) was not pronounced enough to compete with significance of other annual-to decadal-scale warm extremes of the last 1250 years.
One can discern five major clusters of cold extremes during the LIA with the years of 1192, 1464, 1599, 1697, and 1813 CE ranked coldest in the majority (52 %) of reconstruction ensemble members. Figure 9 suggests that 1464 CE is most likely to be the coldest year after 750 CE, while the coldest decade of the reconstruction is represented by a sequence of spatially coherent potentially coldest years within 65 • W-180 • E (Fig. 9b). Figure 7e shows the spatial pattern of cooling for this decade of the LIA with 7 out of 10 years over the period of 1811-1820 CE ranked potentially coldest across the entire reconstruction ensemble.
Contemporary warming is manifested as a sequence of potentially warmest years starting in 1980 CE within 45-100 • E and 60-110 • W and since 1995 propagating to almost the entire reconstruction domain. Figure 7f shows a spatial map of temperature anomalies for the period 1993-2002 CE that features 5 out of 10 statistically significant warm extremes on the pan-Arctic scale. When compared with the probability density marginalised over the spatial domain displayed in Fig. 8, contemporary warming clearly reveals a coherence both in the spatial domain and agreement over the range of ensemble members that is at least as strong as the estimates made for the MCA. Additionally, about 30 % of the potentially warmest years in the entire reconstruction ensemble were registered in the time interval of 1993-2002 CE. The year 2002 is ranked warmest in 14 % of the reconstruction members, which is almost 3 times as many as any other potentially warmest year detected in the analysis. One should emphasise that this statistically significant sequence of warm extremes was detected outside the calibration period, which provides another indirect proof for a skill of our new Arctic reconstruction. This reconstruction, however, does not extend into the very last 15 years, over which warming in the Arctic has been continuing. With these years included in the analysis, the signature of the CWP would much likely become more prominent (see discussion in Sect. 4.1). The warmest periods in the reconstruction shown in Fig. 7 share similar features in the higher latitudes. The circum-Arctic warm anomalies at the shorelines are linked in the current period to the receding sea ice margin. This is indicative of a possible minimum of sea ice extent during the MCA similar to the one observed now.

Discussion and conclusions
This paper presented a new circum Arctic CF reconstruction of summer season temperatures back to 1 CE. Due to the sparse proxy network and thus large uncertainties, the spatial reconstruction is evaluated only back to 750 CE with the Arc-tic average SAT anomaly evaluated back to 1 CE. The reconstruction uses a subset of 44 annually resolved temperaturesensitive terrestrial proxy archives of various types mainly from an updated and corrected PAGES2k database supplemented with six new recently published Greenland ice core series.
The technique applied is a recent extension of the Bayesian BARCAST, which explicitly treats climate archives with dating uncertainties (Werner and Tingley, 2015), which previously would be used on their "best guess" chronologies. The generated ensemble of 670 equally likely, independent realisations of past CF evolution in the Arctic together with a corresponding ensemble of synchronised chronologies represent the two major data outcomes of this study. As highlighted in Sect. 4.1, the probabilistic nature of the reconstruction results in straightforward uncertainty estimates even for complex analyses. As quantiles for a particular type of analysis are evaluated for individual ensemble members, the overall intra-ensemble coherence determines the spread and hence uncertainty of these quantities. The resulting ensemble of reconstructions including the ensemble of likely chronologies thus provides a convenient data set for further studies.
The quality of the reconstruction in the spatial and temporal domains was tested using a suite of metrics such as continuously ranked probability score (CRPS pot ) and the reliability score, which are more appropriate for the Bayesian framework than the coefficient of efficiency and reduction of error typically used in palaeoclimate research. Judging from these scores it could be demonstrated that the new reconstruction is skilful for the majority of the terrestrial nodes in the reconstruction domain, making it a useful product for studying the late Holocene Arctic temperature variability at regional scales. However, from the analysis of intraensemble variability, but also from analyses on the extreme years and the calculated confidence intervals the reduction of skill back in time is apparent. This is mostly caused by the proxy network, which is getting sparser when going back in time, and should be taken into account when the new reconstruction is used for making any quantitative inferences.
In addition to presenting the new reconstruction and assessing its quality, the derived ensemble is used to uncover the potential of the new product and consider the results in light of previous studies on the subject. The major findings from the analysis of the new reconstruction are as follows.
The area-averaged Arctic2k reconstruction features similar major cold and warm periods throughout the last 2 millennia and thus compares favourably with earlier studies targeting a similar season and region. In particular, there is a pronounced orbital-scale cooling trend over the CE -a period over which the summer insolation has mostly been decreasing, although the spatial pattern cannot be reliably reconstructed over the full CE due to the sparse proxy network before ca. 750 CE. Since the proxy data set from Greenland is dominated by oxygen isotope series from ice cores, these can be subject to a possible warm bias during the LIA bias caused by increased storm activity (Fischer et al., 1998) and/or be influenced by the site and source temperature-compensating effects (Hoffmann et al., 2001;Masson-Delmotte et al., 2005). The ice cores from northern Greenland are also expected to have a higher fraction of summer precipitation than those from the south due to the effect of continentality on the annual accumulation and hence exhibit a higher sensitivity to summer conditions. While site and source temperature compensating effects for the individual series can be accounted for by using the records of deuterium excess (Masson-Delmotte et al., 2005), other potential biases are difficult to resolve without additional support, e.g. from general circulation models.
The analysis of the reconstruction reveals the spatial signatures of the two major climate anomalies back to 750 CE, the MCA and LIA, as well as the beginning of the CWP in the circum-Arctic region. The MCA expression in the circum-Arctic region can be associated with a century-scale period between ca. 920 and 1060 CE showing an area-average SAT anomaly of 0.1 ± 0.1 • C during the warmest centennial period of 927-1026 CE. The MCA features two decadal-scale temperature maxima, both showing similar spatial extent of the regional SAT anomalies with the largest expression in the North American segment of the Arctic realm. A coherent warming of the period 927-936 CE during the first maximum of the MCA is associated with a potentially warmest decade of the reconstruction with the area-average summer temperature anomaly of 0.48 ± 0.31 • C. While the most recent warming shows an even stronger regional coherence than the MCA, even across continents (PAGES 2k Consortium, 2013;Ljungqvist et al., 2016), the MCA was still an unusual and extremely warm period in the context of the past 2 millennia. Note that despite the evidence for prominent and lasting temperature fluctuations in the pre-750 CE period as well, these results should be interpreted cautiously due to the drastic reduction in proxy data density in the early part of the reconstruction period.
The new reconstruction suggests a relatively long, though interrupted by abrupt decadal-scale warmings, transition to the LIA after the second of the two MCA maxima ends at around 1060 CE. The coldest century-long period of 1766-1865 CE shows an almost spatially coherent circum-Arctic summer cooling. The cooling over the LIA, from essentially around the late 11th century went on until the mid 19th century CE. Most of the Arctic was coldest during the decade of 1811-1820 CE following the 1809 (unknown) and 1815 (Tambora) eruptions, which caused the "Year without a Summer" in 1816 over most of Europe and yielded a circum-Arctic SAT anomaly of −0.8 ± 0.2 • C.
The last decade of the reconstruction of 1993-2002 CE, being outside the calibration interval, accommodates some 30 % of the potentially warmest years across the ensemble since 750 CE, with half of them associated with the year 2002 CE alone. Yet given the input data available and the re-construction method used, it still cannot be decided with sufficient confidence whether the warmest century and decade of the MCA or the CWP were warmest in the reconstruction. We speculate, however, that having the reconstruction extended into the very last 15 years, over which warming in the Arctic has been continuing, might have confirmed the summer SAT anomaly in the terrestrial Arctic to exceed the previous anomalous warm period of the Common Era.
The spectral characteristics of both proxies and reconstruction show that work is still needed on generating more and longer high-quality proxy series in parallel with a reanalysis of the existing data. Especially updating many of the only half-century-long North American tree ring series towards the present, but also possibly extending some of them into the first millennium CE seems to us like worthwhile effort (Babst et al., 2017). Additionally, a relative "flatness" of spectra on sub-decadal to multi-decadal timescales contrasting with an inflated variance of the multi-decadal to millennial variability (Appendix C) for some of the tree ring chronologies, suggests that a reassessment and potentially a revision of the raw data processing techniques used for these chronologies would be highly desirable.
BARCAST as a CF reconstruction technique still offers a large potential for future development and use in new, improved reconstructions. In addition to explicitly including the annually dated proxies with the chronological uncertainties in the scheme, which is a major innovation of the presented reconstruction, the next natural step will be a development of a theoretical and numerical framework to extend the technique to non-annually resolved proxy archives with chronological uncertainties. This will enable a substantial extension in the proxy coverage both in the spatial and time domains, including the marine realm dominated by non-annually resolved marine sediment proxy archives, potentially promoting an improved performance of the reconstructions at the low-frequency (centennial) timescales. While relatively flexible, the BARCAST framework would however still need major modifications that allow proxy response functions that are sensitive over different frequency bands. Additionally, these frequency bands need to be either proposed and fixed a priori, with possibly insufficient information available, or determined by the algorithm itself, potentially leading either to overfitting or convergence problems.
Code and data availability. The input proxy data are available through the individual publications (see tables in the Appendix). The majority are also available in the recent temperature database of PAGES 2k Consortium (2017), except for the NGT ice cores.
The treated input data and the R script files used for the treatment of the input data as well as the reconstruction results (ensemble reconstruction, gridded ensemble mean, and area mean), together with the program code, are made available through NOAA (https: //www.ncdc.noaa.gov/paleo/study/23031). The BARCAST code is © Werner and Tingley (2015).

Appendix A: Calibration and validation statistics
In order to estimate the skill of the reconstruction two different measures are used: the average potential continuously ranked probability score (CRPS pot ) and the reliability (Reli) score (Hersbach, 2000;Gneiting and Raftery, 2007;Werner and Tingley, 2015;Tipton et al., 2016). The reliability analyses the accuracy of the uncertainty estimates. In principle it compares the empirical coverage rates of uncertainty bands with their respective nominal coverage rate, e.g. a 95 % confidence band should contain the target truth 95 % of the time. The CRPS pot measures the accuracy of the reconstruction itself, i.e. the mismatch between the best estimate and the target. In a deterministic reconstruction it is equal to the mean absolute error. Both measures retain the original units of the data, and both signal a better result the lower they are. The results are shown in Fig. A1 (top row). For the calibration (validation) interval, the CRPS pot is mostly below 0.2 • C (0.5 • C), and the reliability is sharper than 0.1 • C. This in principle indicates a relatively low reconstruction error, with uncertainty bands that (within reason) reflect the correct uncertainties.
Additionally the skill of the reconstruction beyond forecasting the calibration or validation period climatology is evaluated. In palaeoclimate reconstructions this is often assessed by the coefficient of efficiency and the reduction of error statistics (Cook et al., 1994). These analyse whether the reconstruction is closer to the validation target than the climatological mean of the calibration or validation period, respectively. However, these are not proper scoring rules (Gneiting and Raftery, 2007) and should thus not be used to analyse the results of a probabilistic reconstruction method. In essence, these two skill measures compare the reconstruction over the validation period to the mean climatology of the calibration (RE) and validation (CE) period (Lorenz, 1956;Briffa et al., 1988).
As introduced by Tipton et al. (2016), in order to generate a similar statistic, the mean and standard deviation over the validation and calibration intervals for each location with instrumental data are calculated. These are then used to generate an ensemble of time series. They act as simple surrogates for the calibration and validation interval climatologies, which are then compared against the target instrumental data of the validation period, using the CRPS pot . Should this value be lower than the CRPS pot comparing the actual reconstruction ensemble against the instrumental data, the reconstruction does not add skill over the climatology. Thus, subtracting the CRPS pot of the reconstruction from the CRPS pot of the surrogates results in measures that indicate a skilful reconstruction if they are positive, i.e. a reconstruction that performs better than the climatology over the calibration (validation) interval. We denote these two scores as CRPS RE and CRPS CE . The bottom row in Fig. A1 shows that about half of the grid cells with instrumental data have a CRPS RE and CRPS CE that is above zero -and these grid cells are mostly those that have the longest instrumental time series (inside and outside the calibration interval). Thus, these results not only reflect a possibly weak reconstruction but more likely the lack of actual instrumental data to construct any meaningful comparison statistics over the validation period.   Figure B1 presents the time changes in the spatially averaged intra-ensemble variance as a measure of the spread across the ensemble members. The variance shows a progressive decline over the pre-industrial reconstruction more pronounced in the confidence intervals (CIs) for the period 800-1000 CE (which is linked with the time of an expansion of the multiproxy network). Along with the intra-ensemble variability, a progressive increase in the proxy data density over time contributes to the observed decrease in the ensemble spread. The introduction of the instrumental data into the scheme (corresponding to a calibration period in the regular climate reconstruction language) causes a sharp drop in the spread after 1850 CE that reaches a minimum around 1950 CE, a period of the maximal instrumental data coverage. Figure B2 further illustrates the effects of the spatial changes in input data density on the reconstruction intra-ensemble spread. The figure presents intra-ensemble spatial variances averaged over four time periods. The selected time slices are associated with periods of distinctly different proxy and calibration data density: part of the Roman Warm Period (200-300 CE) with a climate field reconstruction based on eight proxy records only, one of the coldest periods of the LIA (1600-1700 CE) with a complete multi-proxy network, and parts of the calibration period of 1850-1900CE and 1950-1980 CE, representative of the low and high instrumental data coverages, respectively.

Appendix B: Intra-ensemble variance of the reconstruction
Year  Figure B2. Time-averaged intra-ensemble variance of the Arctic2k reconstruction shown for the four sub-periods with a distinct difference in proxy data density (200-300 CE vs. 1600-1700 CE, panels a and b) and calibration sub-periods with different instrumental data coverage (1850( -1900( vs. 1950( -1980. Black dots show the proxy locations with at least one data point over the period of averaging.

Appendix C: Statistical properties of the reconstruction and input data
As an additional test for the reliability of the proxy series and the validity of the climate field reconstruction, the temporal persistence of both need to be analysed. The Kolmogorov-Smirnov test is first used to test the normality of the proxy records and the climate field reconstruction, with a significance level p = 0.05, and additionally Q-Q plots are checked. Then, the power spectral density (PSD) is used to study the variability at different frequencies for the records, using the periodogram as an estimator of the PSD. The periodogram is defined here in terms of the discrete Fourier transform H m as S(f m ) = (2/N)|H m | 2 , m = 1, 2, . . ., N/2. The sampling time is the time unit (here: years), and the frequency is measured in cycles per time unit: f m = m/N. f = 1/N is the frequency resolution and the smallest frequency which can be represented in the spectrum, while f N/2 = 1/2 is the Nyquist frequency.
The characteristic shape of the spectrum provides useful information about the temporal persistence or memory of the underlying process. If the data are close to Gaussian and monofractal, the second-order statistics are sufficient to describe the statistical properties of the data. The spectral shape can then be associated with well-known stochastic processes. If the spectrum has a power-law shape, the process exhibits long-range memory (LRM). The strength of memory in an LRM stochastic process is described by the spectral exponent β, which can be estimated by a linear fit to the power spectrum; log S(f ) = −β log f + c. If the spectrum is Lorentzian (power law on high frequencies, flat on low frequencies), the underlying process is closer to an AR(1) process. In all spectral analyses, the fitting is applied to log-binned periodograms to ensure that all timescales are weighted equally. If the Gaussianity and monofractality criteria are not met, there could be underlying structures such as intermittency that are not captured by the analyses. In the temperature time series considered here, deviations from normality are due to nonlinear dynamics associated with volcanic eruptions, for example.

C1 Spectral analyses of the proxy records
The six proxy records originating from lake sediments deviate substantially from a Gaussian distribution and thus had to be transformed before analysis. Afterwards, around 60 % of the individual proxy records are Gaussian according to the Q-Q plots and the p values from the Kolmogorov-Smirnov test.
The characteristic shape of the spectra for all of the proxy records are classified into three spectral categories: (1) AR(1) processes, (2) persistent power-law processes with spectral exponent 0 < β < 1, and (3) records exhibiting weak persistence on high frequencies and increased levels of variability in frequencies corresponding to timescales longer than decadal-centennial. Figure C1 illustrates the spatial distribution of the proxy records with proxy type indicated by shape, and categories with colours. The Greenland records are similar to either an AR(1) or an LRM process. The Greenland LRM records are in fact only weakly persistent, with a spectral exponent 0 < β < 0.3. There is thus little evidence of long-term cooling due to orbital forcing from these records.
Along with a few tree ring records, the Greenland ice core records are the longest records used for the present reconstruction. As the low-frequency variability in these records dominates the reconstructed long-term variability, the resulting reconstructions do exhibit similarly low variability at long timescales. The proxies of category 3 are mainly tree ring records, widely distributed along the reconstruction region. These records may require additional attention in future studies, as the level of high-versus low-frequency variability is unusual compared to other proxy records and also instrumental measurements. Similar spectral characteristics were obtained for other tree ring chronologies in Franke et al. (2013), Zhang et al. (2015), Esper et al. (2015) and Büntgen et al. (2015). The persistence in a number of millennium-long climate model simulations and proxy-based temperature reconstructions has been studied in Østvand et al. (2014) and Nilsen et al. (2016) using the power spectrum along with selected other techniques. In these studies, LRM was detected in all records up to centennial-millennial timescales.

C2 Spectral analyses of climate field reconstruction
The resulting p values from the Kolmogorov-Smirnov test indicate that for individual locations of the field reconstruction, about 60-80 % of the ensemble members are Gaussian. For each ensemble member of the reconstruction, and each location, a spectral analysis is performed. Then, the persistence is analysed for ensemble-averaged spectra of each location. The analyses indicate that the reconstructed temperature is best described as an AR(1) process in time at almost all grid cells. This is not surprising, as the longest proxy records exhibit low levels of long-term variability, and the BARCAST reconstruction technique assumes an AR(1) model for the temporal evolution of the temperature. Further details about the characteristic transition times are obtained by making a least-squares fit of a bilinear continuous function for the spectrum. The detected break is located where the two lines intersect. The coloured map in Fig. C2 shows the spatial distribution of the found transition timescales; black dots indicate that the difference between the spectral exponents for low and high frequencies is more than 1. The spatial coherence indicates that BARCAST performs well when extrapolating temperatures to locations where observations are unavailable. For most of the area we find a marked transition in the spectral slope (black symbols). Only the east coast of Greenland and the Scandinavian sector have slightly less difference between the high-and low-frequency variability; that is, the spectral exponent does not change much between the two identified scaling regimes. This indicates more similarity to an LRM process. Additionally, the transition timescale is above 100 years for a number of single locations. There, the reconstruction is indeed closer to an LRM process than an AR(1) process.  Figure C3. The mean of the posterior SNR for all proxy series. Note the logarithmic scale. There is one series with a negative β 1 (grey).

C3 Proxy response
The BARCAST output parameters contain information on the proxy signal strength (β 1 in Eq. 1b) and proxy noise level (τ 2 P , Eq. 1b). Under the assumption of a unit standard deviation climate variable, the ratio of β 1 /τ P returns an estimate of the signal-to-noise ratio (SNR, in amplitudes) of the individual proxy series. The mean of all ensemble draws is shown in Fig. C3. Note that one proxy series (Finnish Lakeland) has a negative (inverted) response (β 1 ≈ −0.38 ± 0.7).
In general, most tree ring series have a SNR > 1, and they seem to be the highest-quality proxies on average, followed by the ice core data. The lake sediments add only some skill, but are important especially in regions with no other data present, such as eastern North America. This might be caused by the in general higher dating uncertainties (see discussion in Werner and Tingley, 2015) or a response on different timescales. This underlines the necessity to really use multiple proxies and to further improve the reconstruction methods to make use of information on different timescales.

Timescale modelling
In order to account for possible chronological uncertainties in the annually resolved proxy records, the technique of Comboul et al. (2014) is applied to the proxies with layer-counted timescales for the generation of ensembles of chronologies. BAM (banded archive modelling) simulates the timescale counting procedure as a superposition of two cumulative Poisson processes with age perturbations associated with two categories of errors: either miss (type 1) or doublecount (type 2) of an annual layer. More specifically, for each measurement x i assigned a time t i with i ∈ {1, . . ., n}, and a neighbouring x i+1 with t i+1 , i ∈ {2, . . ., m}, the vector of time increments δ, t i+1 − t i = δ i comprises two independent stochastic processes P 1 and P 1 , with parameters 1 and 2 , representing the rates of missing and doubly counted annual layers, respectively. We note that the approach implicitly assumes the independence of the two stochastic processes and depth (time) invariance of the error rates.
For the proxy series with chronologies constructed using a combination of annual layer counting and time markers (tie points) t k , k ∈ {1, . . ., K}, such as volcanic sulfate peaks or tephras with ages known to a specific precision (σ k ), a twostep procedure was implemented. The first step involved an Markov chain Monte Carlo simulation of M perturbed sets of tie points t m k following Divine et al. (2012), where [ • ] stands for rounding the argument to the nearest integer. For each particular set m of perturbed tie points and a time interval t m k , t m k+1 , k ∈ {1, . . ., K − 1} between the perturbed pairs of tie points timescale modelling was applied, and only those that satisfied a criterion of δ i = t m k+1 − t m k were retained for further analysis. For ages older than t m K a model with a free boundary was used instead. In total M = 1000 timescales t m i per proxy archive were generated. Using interpolation, the proxy series x i were further projected on the generated timescales t m i to yield the ensemble of proxy series with perturbed chronologies.
The error rates { 1 , 2 } were estimated for each particular proxy archive. In the framework the counting procedure is defined, for each point t i of the true unknown timescale the uncertainty of the modelled timescale follows the Skellam distribution with parameters {λ 1 , λ 2 } = {(t s − t i ) 1 , (t s − t i ) 2 }, where (t s − t i ) denotes the time lapse between t i and a counting start point t s (Comboul et al., 2014). For a symmetric error rate 1 = 2 and (t s − t i ) large enough, it converges to a normal distribution N (0, λ 1 + λ 2 ). The error rates can therefore be estimated as ˆ 1 ,ˆ 2 = argmin 1 , 2 ( δt max · ( 1 + 2 ) − t /4), (D1) where for a particular proxy archive δt max = argmax k (t k+1 − t k ), k ∈ {1, . . ., K − 1}, or the entire length of the chronology, and t denotes an estimated largest offset of the reported timescale from the unknown true timescale. For the majority of records we estimated the type 1 and type 2 error rates using the authors reports on the tie point used and uncertainty of the constructed chronologies. For the few archives in which the chronological uncertainties were not reported, a conservative estimate of [ 1 , 2 ] [0.05, 0.05] was assigned. Table D1 shows the list of proxy series together with parameters of the model used to simulate the annual layercounting process. In total ensembles of timescales for 13 annually dated records of the Arctic2k network, six ice core and seven lake sediment records, plus seven annually dated ice cores from the North Greenland Traverse from 1993 to 1995 (recently reanalysed by Weißbach et al., 2016), are generated. Table D1. List of the proxy records including the proxies of the Arctic2k network (PAGES 2k Consortium, 2017) with layer-counted timescales used in the present study together with parameters of the probabilistic model used in Monte Carlo simulations of the layer-counting process. The archives lacking information on the dating uncertainty are marked "*"; a conservative estimate of 1 , 2 [0.05, 0.05] was used in timescale modelling.