Implication of methodological uncertainties for mid-Holocene sea surface temperature reconstructions

We present and examine a multi-sensor global compilation of mid-Holocene (MH) sea surface temperatures (SST), based on Mg /Ca and alkenone palaeothermometry and reconstructions obtained using planktonic foraminifera and organic-walled dinoflagellate cyst census counts. We assess the uncertainties originating from using different methodologies and evaluate the potential of MH SST reconstructions as a benchmark for climate-model simulations. The comparison between different analytical approaches (time frame, baseline climate) shows the choice of time window for the MH has a negligible effect on the reconstructed SST pattern, but the choice of baseline climate affects both the magnitude and spatial pattern of the reconstructed SSTs. Comparison of the SST reconstructions made using different sensors shows significant discrepancies at a regional scale, with uncertainties often exceeding the reconstructed SST anomaly. Apparent patterns in SST may largely be a reflection of the use of different sensors in different regions. Overall, the uncertainties associated with the SST reconstructions are generally larger than the MH anomalies. Thus, the SST data currently available cannot serve as a target for benchmarking model simulations. Further evaluations of potential subsurface and/or seasonal artifacts that may contribute to obscure the MH SST reconstructions are urgently needed to provide reliable benchmarks for model evaluations.


Introduction
The mid-Holocene (MH, 6 ± 0.5 ka BP, 4705-5755 14 C yr BP, Reimer et al., 2009) is one of the three palaeoclimate experiments included in the fifth phase of the Coupled Modelling Intercomparison Project (CMIP5: Taylor et al., 2012) which uses palaeoclimate simulations as an opportunity to evaluate how well models can reproduce climate changes outside the range of the instrumental period (Braconnot et al., 2012;Schmidt et al., 2013).The choice of the MH capitalises on the fact that this period has been a major focus for data synthesis, model simulations and data-model comparisons within the Palaeoclimate Inter-comparison Project (PMIP: http://pmip.lsce.ipsl.fr).The MH is the nearest period in the past with similar ice-sheet extent as the present day but characterised by a large change in the seasonal and latitudinal distribution of insolation leading to an enhanced seasonal cycle of temperature in the Northern Hemisphere (NH) and a reduced seasonal cycle in the Southern Hemisphere (SH) (Braconnot et al., 2007).
Terrestrial archives provide robust reconstructions of the spatial and seasonal patterns of MH land-based temperature and precipitation anomalies (Bartlein et al., 2011).Evaluations of the CMIP5 simulations using terrestrial MH reconstructions show that climate models reproduce the direction and large-scale spatial patterns of the seasonal reconstructions (Izumi et al., 2013;Li et al., 2013;Schmidt et al., 2013) but often fail to reproduce the observed magnitude of regional changes (Hargreaves et al., 2013;Harrison et al., 2013;Perez et al., 2014).
Sea surface temperature (SST) reconstructions have proved to be a valuable tool for evaluation of Last Glacial Maximum (LGM) simulations (Kageyama et al., 2006;Otto-Bliesner et al., 2009;Hargreaves et al., 2011;Wang et al., 2013), but their potential for evaluation of MH simulations still largely remains to be explored.There have been several attempts to reconstruct MH sea surface temperature (SST) for specific regions (e.g. the North Atlantic: Kerwin et al., 1999;Ruddiman and Mix, 1993), but the Global database for alkenone-derived HOlocene Sea-surface Temperature (GHOST) data set of Mg / Ca and alkenonebased SSTs provides the only global product (Kim, 2004;Leduc et al., 2010).Data-model comparisons using the GHOST data set have shown significant mismatches between the modelled and reconstructed SST anomalies (Schneider et al., 2010;Hargreaves et al., 2013;Lohmann et al., 2013).It has been suggested that these mismatches could reflect organisms whose fossils record the ambient temperature, analytical uncertainties, and/or issues related to the ecology of the sensors which may have resulted in changes in depth and/or seasonal habitat compared to the present day (Lohmann et al., 2013).Given that the reconstructed MH SST anomalies are generally small, compared for example to the changes registered at the LGM (MARGO Project Members, 2009), it is important to assess how such factors affect the precision of the reconstructions in order to determine whether a global multi-sensor synthesis of MH SSTs could be used for model benchmarking.
Here, we present a new compilation of SST reconstructions for the MH based on the alkenone unsaturation index, the Mg / Ca palaeothermometer, and temperatures obtained using statistical reconstruction techniques for organic-walled dinoflagellate cyst (dinocysts) and planktonic foraminifera.Since the Mg / Ca palaeothermometer, the alkenone unsaturation index, and census counts of planktonic foraminifera and dinocysts can be used to derive SSTs, and hence provide proxies for temperature, they are often referred to as palaeotemperature proxies.Because they provide a wider range of information than simply SST, we prefer to use the term "sensor".We assess the uncertainties originating from using different sensors and different reconstruction methodologies to evaluate the potential of MH SST reconstructions to benchmark climate-model simulations.

Data collection and quality control
We have compiled site-based SST reconstructions made using the alkenone unsaturation index, the Mg / Ca palaeothermometer, and statistical reconstruction techniques for dinocysts and planktonic foraminifera assemblages, covering all ocean basins (Supplement Table 1).This is the same set of sensors as used in the MARGO LGM synthesis (Kucera et al., 2005a), except that we do not include records based on diatom and radiolarian transfer functions because of lack of available harmonised data sets.Most of the Mg / Ca and alkenone reconstructions are from the GHOST database (Kim et al., 2004;Leduc et al., 2010), but additional Mg / Ca and alkenone records, and the census counts of planktonic foraminifera and dinocysts, were obtained from public archives (e.g.Pangaea, NOAA-NGDC World Data Center for Paleoclimatology) or provided by the original author.
The data set is a selection of the available records from each ocean basin.Only sites that met the following data quality criteria were included in the compilation: 1.The individual records have at least 10 data points between 0 and 10 ka BP, and at least one data point in the 5.5-6.5 ka BP time window.
2. The sedimentation rate is at least 2 cm per 1000 years to ensure that individual samples represent no more than the investigated 1000 years time window, assuming no impact of bioturbation.
3. The chronology was based on at least two radiocarbon dates or other stratigraphic markers within the interval between 0 and 10 ka BP.
We generated new SST reconstructions based on assemblage counts for planktonic foraminifera and dinocysts, using the methods adopted by the MARGO project for the LGM (de Vernal et al., 2005;Kucera et al., 2005b).This was necessary because transfer-function reconstructions were not available for some of the records or because existing transfer-function reconstructions were made using different calibration data sets.However, the Mg / Ca and alkenone palaeothermometry SST values were taken directly from the original publications.In the absence of objective guidelines for reinterpretation of the original measurements, this is the only possible approach.
Most of the individual site chronologies were based on radiocarbon dating.A very few sites have age models based on isotopic stratigraphy, specifically correlation of the benthic oxygen isotope record from the site with the standard SPECMAP composite record (Martinson et al., 1987), the Shackleton benthic oxygen isotope record (Shackleton, 2000), or the LR04 composite record of Lisiecki and Raymo (2005).The chronology of some cores was established by attributing ages to key stratigraphic events, such as sapropel events (e.g.Emeis et al., 2003).Since we only used records that met certain minimum requirements for chronological control, we had no reason to change the age models from the original publications.Therefore, we use the original chronology for each site, including a local reservoir correction if used in the original age model and without recalibrating the radiocarbon dates.In doing so, we rely on the assumption that differences between the different calibrations used in constructing the original age models are negligible over the Holocene age range.

Reconstructions based on planktonic foraminifera
The planktonic foraminifera census counts were initially screened for taxonomic consistency and counting method, and assessed for the effect of carbonate dissolution.Only records that passed this pre-screening were used for further statistical analysis.We did not identify any records from the Indian Ocean that were suitable.The data set therefore includes 57 planktonic foraminifera-based SST records (Supplement Table 1), with 14 from the North Atlantic, 2 from the equatorial South Atlantic, 15 from the Mediterranean Sea, and 26 from the Pacific.The average resolution across the MH interval is 4 samples per 1000 years, with a range of between 1 and 21 samples per core.The planktonic foraminifera census counts were converted into SST estimates using the multi-technique approach described by Kucera et al. (2005b).This approach is based on the simultaneous application of the Modern Analogue Technique (MAT) and the Artificial Neural Network (ANN) methods.The calibration data set was derived from the MARGO LGM project (Kucera et al., 2005b) and uses six regional calibrations against seasonal means of SST at 10 m water depth from the 1998 version of the World Ocean Atlas (WOA98: Conkright et al., 1998).The MAT approach searches the calibration data set for samples with assemblages that most resemble the fossil assemblage.We used the 10 best analogues, identified using the squared chord distance measure, in the Atlantic and Pacific, and the 5 best analogues in the Mediterranean Sea.The ANN method estimates SSTs by mapping the foraminifera census counts onto a highly recursive system of equations iteratively optimised on the training data.The ANN approach is mathematically entirely independent of MAT, e.g. by permitting extrapolation outside the range of parameter values in the calibration data set.
The final SST reconstructions represent the consensus between the two methods.At most of the sites, this is the average of the estimates obtained by the MAT and ANN methods.The calibration error of the foraminifera-based SST reconstructions is dependent on method and region, and ranges between ±0.8 • and ±1.9 • C for winter, ±1.2 • and ±1.6 • C for summer, and ±0.9 • and ±1.7 • C for mean annual SST (Kucera et al., 2005a).

Reconstructions based on dinocysts
The data set includes 28 dinocyst-based SST records (Supplement Table 1), with 24 sites from the North Atlantic and 4 from the Mediterranean Sea.The average resolution across the MH interval is 6 samples per 1000 years, with a range of between 1 and 20 samples per core.
The dinocyst-based reconstructions were made using the MAT, as described in detail by de Vernal et al. (2005de Vernal et al. ( , 2013)).The modern reference database includes 940 sites from the North Atlantic, North Pacific, Arctic Ocean, and adjacent epicontinental seas.The reference sites cover a wide range of environments, from cold to sub-tropical domains, neritic and open ocean conditions, and brackish to fully marine settings.Reconstruction uncertainties were calculated by retaining one-fifth of the data for verification independent of the original calibration.The reconstruction uncertainties of the dinocyst-based SST reconstructions are ±1.2 • C for winter, ±1.6 • C for summer, and ±1.1 • C for annual mean SSTs.

Reconstructions based on Mg / Ca thermometry
There are 38 Mg / Ca-based MH SST records in the data set (Supplement Table 1), with 19 records from the Pacific, 12 from the North Atlantic, 5 from the Indian Ocean and 2 from the South Atlantic.Most of these records came from the GHOST database (Leduc et al., 2010), but we excluded 3 GHOST records because they did not meet our quality criteria and added 9 records.The average resolution across the MH interval is 6 samples per 1000 years, with a range of between 1 and 24 samples per record.
The Mg / Ca temperatures are based on measurements on different planktonic foraminifera species at the different sites.Furthermore, the samples are prepared using different cleaning methods (Barker et al., 2003;Boyle and Rosenthal, 1996;Boyle and Keigwin, 1985;Boyle et al., 1995;Lea et al., 2000;Martin and Lea, 2002;Rosenthal et al., 1999), measured on different machines (ICP-OES, ICP-MS, Q-ICP-MS, flow-through ICP-MS), and calibrated using different equations (Anand et al., 2003;Barker and Elderfield, 2002;Dekens et al., 2002;Elderfield and Ganssen, 2000;Hastings et al., 2001;Mashiotta et al., 1999;Nürnberg et al., 1996;Rosenthal and Lohmann, 2002;Thornalley et al., 2009a;von Langen et al., 2005).Since we use the published reconstruc-tions in our data set, the results could be affected by these differences.The impact of using different analytical methods was addressed in the inter-laboratory comparison studies of Rosenthal et al. (2004) and Greaves et al. (2008).In some cases, the SST reconstructions from different laboratories differed by as much as 3 • C. Inter-laboratory differences are dominated by different instrument calibrations (Greaves et al., 2008) and cleaning methods (Rosenthal et al., 2004).However, each laboratory uses specific SST calibrations, tailored to the taxa and treatment procedures they use, and thus the published temperature estimates are probably more comparable than these straight comparisons would suggest (Rosenthal et al., 2004).
The partial dissolution of foraminiferal calcite alters the Mg / Ca ratio of the shells, such that there is an increasing cold bias in reconstructed SST with increasing water depth (e.g.Regenberg et al., 2006).However, the basic relationship of Mg / Ca with temperature seems robust (Rosenthal et al., 2000).This means that corrections can be applied to compensate for the effect of dissolution, for example by using size-normalised shell weight as an index of dissolution (Rosenthal and Lohmann, 2002) or by applying a water depth correction such as in the calibration of Dekens et al. (2002).We further acknowledge a recent study reporting a more pessimistic scenario in which calcite dissolution may start occurring as shallow as 1000 m water depth in the Pacific Ocean and the Indonesian archipelago, i.e.where most of the Mg / Ca data come from (Regenberg et al., 2014).Since this potential problem was first addressed a long time ago (Russell et al., 1994), we here rely on the expertise of the original authors to have identified whether dissolution is a problem and to have applied a dissolution correction when necessary.Following Anand et al. (2003), we assume that the uncertainty on the estimation of the calcification temperature is ±1.2 • C. The temperature anomalies are calculated by subtracting each record's calcification temperatures from the modern ocean's SST at 10 m water depth obtained from the WOA98.

Reconstructions based on alkenone unsaturated ratio
There are 89 alkenone-based MH SST records in the data set (Supplement Table 1), with 39 records from the Pacific, 26 from the North Atlantic, 6 from the Indian Ocean, 8 from the Mediterranean Sea, and 10 from the South Atlantic.The average resolution across the MH interval is 5 samples per 1000 years, with a range of between 1 and 33 samples per record.Most of the alkenone records have been obtained from the GHOST database (Kim, 2004;Leduc et al., 2010).We excluded 11 of the GHOST records because they did not meet our quality criteria and added 9 new records.Rosell-Melé et al. (2001) examined the analytical precision and reproducibility of alkenone-based temperature estimates gener-ated by different laboratories, and found that inter-laboratory differences were on average ±1.6 • C. The original alkenone-derived temperature estimates were converted into SSTs using several different calibrations (Conte et al., 2006;Müller et al., 1998;Pelejero et al., 1999;Prahl et al., 1988;Prahl and Wakeham, 1987;Rosell-Melé et al., 1995;Sonzogni et al., 1998).A single calibration could be applied for most paleoceanographic settings (Conte et al., 2006), so the use of several different calibrations may introduce a systematic bias (Prahl et al., 2006).However, the calibrations are relatively similar for the intermediate range of temperatures observed in the global ocean, and this issue is only likely to be important under extreme conditions.The global average mean standard calibration error is ±1.2 • C, but larger deviations have been observed in upwelling zones and in the Arabian Sea (Conte et al., 2006).

Defining the "sea surface"
The "sea surface" and its related "sea surface temperature" have been set to 10 m depth following the decision by MARGO (Kucera et al., 2005a).This decision reflects a compromise allowing a harmonisation of SST estimates among different sensors.This choice does not mean that the authors assumed that all sensors record temperature at that depth.Rather, the decision reflects an assumption that all sensors and proxies record an SST signal which is highly correlated to SST at 10 m depth and that it is therefore possible to calibrate the individual proxies against SST at that depth.In the context of this study where the focus lies on SST anomalies, the principle assumptions of this depth-homogenisation are thus that the SST recorded by each proxy and sensor is highly correlated to SST at 10 m depth and that this relationship remained the same for the time slice between the present day and the 6 ka BP Holocene.Whereas the SST depth recorded by phytoplankton sensors is limited to the photic zone, the depth range of species of planktonic foraminifera can be broader.The foraminifera-based Mg / Ca SST estimates are based chiefly on symbiont-bearing species with shallow habitats, whose calcification depth has been constrained to lay within the top 100 m of the water column (e.g.Anand et al., 2003;Regenberg et al., 2009).In contrast, the foraminifera-based transfer-function SST are based on analysis of the entire assemblage and as shown by Telford et al. (2013), it is possible that assemblage composition is sensitive to subsurface temperature, particularly in low-latitude regions.This depth mismatch may be significant when reconstructing temperature of the last glacial maximum, but it remains unclear whether it also has an effect on early Holocene SST estimates.Thus, in the absence of a universally applicable set of criteria for assigning depth to SST estimates by different proxies and sensors, we retained the 10 m depth definition used by MARGO, but we acknowledge that depth-misattribution of the reconstructed SST may be an additional source of uncertainty and may account for mismatch among SST proxies, particularly those based on planktonic foraminifera as a sensor.

The global data set
The final data set (Supplement Table 1) consists of 212 individual SST records, of which 89 are based on alkenones, 38 on Mg / Ca, 57 on planktonic foraminifera, and 28 on dinocysts.The planktonic foraminifera and dinocysts provide mean annual, summer, and winter reconstructions, but the Mg / Ca records are only used for summer and the alkenones for mean annual SSTs as recommended by the MARGO LGM group (Kucera et al., 2005a).
Assigning SST records based on alkenones to meanannual SST and Mg / Ca to summer SST eventually lead to shortcomings concerning the interpretation of palaeothermometers.Alkenone-producing coccolithophores have preferential blooming seasons varying on the basis of regional hydrological and climatological patterns.Schneider et al. (2010) have used satellite observations to compute a seasonality index that depicts where and when primary productivity is increased with respect to the annual SST cycle.They showed that primary producers, one generic term that includes the alkenone-synthesising coccolithophores, preferentially thrive during summer at high latitude because of light limitation, and during winter at low latitudes when increased surface ocean mixing brings nutrients to the photic zone (Schneider et al., 2010).Different foraminifera species also occupy different ecological niches, and their representative season may vary from place to place, depending on the species analysed downcore (see e.g.Lombard et al., 2011).Assigning foraminifera-derived SST to summer temperatures and alkenone-derived SST to mean-annual temperatures hence provides an overly simplified template for SST databases, and much progress must be done into this direction to reduce the uncertainties associated with an SST database derived from multiple sensors.
We calculate MH annual, summer, and winter SST anomalies by subtracting seasonal SST reconstructions from a modern seasonal reference climate.Winter is defined as January, February, and March in the NH and July, August, and September in the SH; summer as July, August, and September in the NH and January, February, and March in the SH.We follow the protocol established for the MARGO LGM reconstructions (Kucera et al., 2005a) by using WOA98 as a modern reference (Supplement Table 2), but we also explore the use of other potential reference climates (Sect.3.1).The MH temperature at a site is the average of all measurements within the 5.5-6.5 ka BP window (Supplement Table 2), but we also examined the potential use of a smaller time window (Sect.3.2).Although many of our analyses are based on reconstructions at individual core sites, we have also gridded the reconstructions on a regular 5×5 • latitude/longitude grid by averaging all of the records for a given season.
The complete data set is available at www.pangaea.de(doi:10.1594/PANGAEA.830814;doi:10.1594/PANGAEA.830811).In addition to the data provided in the Supplement, it contains age model information of the previously unpublished records.

Impact of the choice of baseline climate
The most robust way of comparing model outputs and palaeoclimate reconstructions is through the use of anomalies, the difference between a palaeoclimate reconstruction or experiment and a corresponding modern baseline observation or control experiment.In contrast to terrestrial environments, it is often difficult to obtain modern samples in the ocean.To reconstruct the change in SSTs at the LGM, MARGO used observed temperature at 10 m water depths from WOA98 as a modern reference temperature (MARGO Project Members, 2009).Other studies have used different baselines (Marcott et al., 2013;Ruddiman and Mix, 1993) or have calculated anomalies relative to a long-term average (e.g. the last 1000 years: Harrison et al., 2013;Leduc et al., 2010) derived from the core top sediments.To test the impact of the choice of baseline climate on the reconstructed SST anomaly patterns, we examined the effect of using the updated version of the World Ocean Atlas (WOA09; Locarnini et al., 2010) and the Hadley Centre Sea Ice and Sea Surface Temperature data set, which covers the period of 1900 to 2000 (HadiSST; Rayner et al., 2003).We also examined the impact of using a long-term core-top average to calculate the anomalies, by comparing data from the GHOST database (which includes a "modern" reference based on the 1000year core top average) with the anomalies from WOA98.
The average of the absolute difference in the MH mean annual SST anomalies based on WOA98 and WOA09 is 0.3 • C (Fig. 1a), while the average absolute difference between WOA98 and the HadiSST data set is 0.4 • C (Fig. 1b).Differences in the reconstructed anomalies using different baselines exceed 1 • C in some areas (Mediterranean Sea, midlatitude eastern Pacific).The differences in the MH anomalies estimated using the core top reconstructions as the modern reference compared to the WOA98 reference are even larger (Fig. 1c), with an average of the absolute difference of 2 • C, and again this affects the spatial pattern of the reconstructed SST anomalies.The impact on the spatial patterning is reflected in the frequency distributions of the anomalies relative to the different reference climates (Fig. 1d-f), which are different in terms of dispersion and skewness.The choice of baseline climate has an equally large impact on seasonal anomalies (Supplement Figs. 1, 2).Thus, the choice of baseline climate affects both the magnitude and the spatial pattern of reconstructed MH SST anomalies.

Impact of the choice of time frame
In developing synthetic data sets for data-model comparisons, the MH has conventionally been defined as 6.5 to 5.5 ka BP (Kohfeld and Harrison, 2000;Leduc et al., 2010;Prentice et al., 2000) with reconstructions being made based on all samples falling within this window.The use of aver-age values within a specified time window prevents the selection of single samples that represent minor climate oscillations to compare with a simulation representing long-term average conditions, and also maximises the geographic coverage of sites.However, it assumes that short-term (interannual to inter-decadal) climate variability has a negligible impact on the long-term average signal.While this appears to be the case for land reconstructions (see e.g.Bartlein et al., 2011), this may not be true in the marine realm where the MH changes are smaller.
More than 80 % of the records in the data set have multiple samples falling in the conventional MH window, where the anomalies would therefore normally be estimated as the average of values from multiple samples.We tested the impact of choosing different sampling windows by examining the variability at individual sites with resolution of < 100 years (Fig. 2a and Supplement Figs.3a, 4a) and also by comparing the results obtained by averaging over the 6.5 to 5.5 ka BP time window and by averaging over a shorter time period (6.25 to 5.75 ka BP) (Fig. 2b and Supplement Figs.3b, 4b).These comparisons show that between-sample differences within the 1000-year window can be large (range between 1-3 • C), and the between-sample variability is not reduced when considering the 500-year window (range between 1-3 • C).There is no difference in the variability as a function of sample size between the broader and narrower time windows (Fig. 2c and Supplement Figs.3c, 4c).As a result, the magnitudes and spatial patterns of the anomalies obtained using averages for 1000-year and for 500-year windows are similar.However, using the 500-year window would reduce the number of points represented on a synthetic map.While this means that the convention of defining the MH as 6.5 to 5.5 ka BP for data-model comparisons is acceptable, the considerable between-sample variability is problematic given that the expected changes in SSTs are small in most regions.

Sensor comparison
The use of multiple sensors increases the number of data points available to reconstruct global SST patterns, but raises the issue of the comparability of reconstructions from different sensors.There are only 21 (out of a total of 212) records in the data set where reconstructions from two sensors are available.It is difficult to see any consistent relationship between the reconstructions made with different sensors at the same site.For example, although reconstructions based on foraminifera consistently yield colder mean annual temperatures than reconstructions based on alkenones, the difference can be negligible at some sites and several • C at others (Fig. 3).In the seasonal reconstructions, even the sign of the offset between sensors is inconsistent (e.g.dinocyst reconstructions show conditions both colder and warmer than the corresponding foraminifera-based reconstructions (Fig. 6)).
However, there is an insufficient number of points, overall and for any one season, to make site comparisons meaningful.We therefore compare the individual sensor reconstructions by season for specific ocean regions, using only regions where there are at least three records for a given sensor.The different sensors give comparable estimates of the median change in annual SSTs (taking into account the uncertainty range) in most of the regions, except in the North Atlantic, where alkenone-based reconstructions indicate much warmer temperature anomalies than either foraminifera or dinocysts (Fig. 4).This discrepancy is most marked in comparisons where the median is calculated from all of the individual samples within the 1000-year window between 6.5 and 5.5 ka BP from each record (Fig. 4a), but the difference between alkenone-based and foraminifera-based reconstructions is still outside the range of uncertainties when the median is estimated from the average MH SST anomaly of each of the individual records (Fig. 4b).Although summer reconstructions from different sensors give similar estimates (Fig. 4a, b), the median change in the South Atlantic estimated from foraminifera and Mg / Ca are significantly different, with Mg / Ca reconstructions indicating very large cooling (Fig. 4a, b).Even in cases where the median estimates are similar across all sensors (within the range of uncertainty), the between-sample and between-site variability in SST can be very large.In the Pacific, for example, where the median values obtained from alkenones and foraminifera for both mean annual and summer anomalies are similar, the interquartile range based on all the samples is ca. 3 • C and the full range is ca. 10 • C (ca. 7 • C when only the record averages are used).Similarly large differences between sensors, and variability, can be seen along latitudinal transects within specific regions (Supplement Fig. 5).

Regional sea surface temperature pattern
It is common practice to grid individual site-based reconstructions (e.g.MARGO Project Members, 2009;Bartlein et al., 2011;Annan and Hargreaves, 2013;Harrison et al., 2013) to facilitate comparison with gridded climate-model outputs.We derived gridded estimates of summer, winter, and mean annual MH SST anomalies by averaging values from every sample from every record within a 5 × 5 • latitude/longitude grid.We estimated the standard deviation (SD) for each grid cell based on all values in the grid cell.The data set yields values for 122 grid cells (Supplement Table 3), with grid cell values being based in some cases on a single sample from a single record and in other cases multiple samples from between one and nine records.
The gridded maps (Fig. 5) suggest that annual mean SSTs in the mid-to high latitude NH and mid-latitude SH were warmer than in the present (Fig. 5b).The upwelling cells off southwest Africa and off Chile display annual mean conditions warmer than today, with the signal being more pronounced in the eastern South Atlantic.In contrast, mean annual SSTs in the tropics appear to be cooler than today.The reconstructed summer SSTs (Fig. 5c) are cooler than today Only sensors that are represented by a minimum of three data points in any basin are plotted.The box-and-whisker plots were drawn using the Golden Software Grapher, which applies Tukey's method showing the 5 to 95 percentiles.The line shows the median.The whiskers are drawn down to the 5th percentile and up to the 95th.Points below and above the whiskers are drawn as individual dots.Outliers are calculated 75th (25th) percentile plus (minus) 1.5 × IQR (interquartile range).If this value is greater than (smaller than or equal to) the largest value in the data set, the upper whisker are drawn to the largest value.Any points greater (smaller) than 75th (25th) percentile plus (minus) 1.5 × IQR are plotted as individual points.The chance of finding an "outlier" by Tukey's rule in data sampled from a Gaussian distribution depends on sample size.everywhere except the high-latitude Arctic Ocean.In winter, the signal in the North Atlantic is spatially variable, but there is a contrast between warmer-than-present SSTs in the eastern Pacific Ocean and cooler-than-present SSTs in the western Pacific (Fig. 5d).However, consistent with the results shown for individual ocean basins (Fig. 4), the maps suggest that the overall change in SSTs is small (average of gridded annual mean = 0.54 • C, summer = −1.01• C, winter = −0.13• C), with high inter-site variability.

Assessment of significance of reconstructed changes in sea surface temperatures
We assess the significance of the reconstructed changes in SST by comparing the magnitude of the anomalies with the standard error, based on sites with at least three samples in the 6.5 to 5.5 ka BP window, assuming that a reconstructed change is significant when it exceeds twice the standard er-ror (SE) after taking into account the measurement or calibration uncertainties associated with the sensor on which the measurement were performed (Fig. 6).Most of the reconstructions, both for individual site records (Fig. 6a) or gridded reconstructions (Fig. 6b) do not show significant changes in SST.Specifically, we find that only 34 % of the site-based reconstructions and 33 % of the gridded reconstructions of mean annual SST are significant; 28 % of the site-based reconstructions and 33 % of the gridded reconstructions of summer SST are significant; 29 % of the site-based reconstructions and 16 % of the gridded reconstructions of winter SST are significant.Furthermore, more than 75 % of the gridded reconstructions are based on single records.If we consider only those grid cells where the reconstruction is based on multiple core records (as well as multiple samples) from each core, then only one grid cell shows significant seasonal or mean annual anomalies (Fig. 6c).Although we assume all uncertainties are independent, a certain level of dependency may exists nonetheless.However, considering the uncertainties as dependent would lead to the t test identifying even fewer records as being significant.

Reliability assessment
In the absence of independent evidence, there is no objective way to assess the reliability of the gridded SST patterns.MARGO (2009) established a semi-empirical method to assess the uncertainty on individual LGM reconstructions.This method combines the calibration error and measurement uncertainty for each sensor, with an arbitrary measure of confidence in the estimate and a semi-quantitative assessment of uncertainty due to dating and internal variance based on the number of samples per core lying in the specified time window and the quality of the age model of each record.This is then combined with the variability of the SST reconstructions within a grid cell to provide an assessment of the overall reliability of the gridded reconstructions.Using the same approach, and considering the SST signal to be reliable when the reconstructed SST anomaly is at least twice as large as the weighted uncertainty, only 1 % of the mean annual, 4 % of the winter, and none of the summer SST reconstructions can be considered as reliable.The low number of grid cells considered as having reliable reconstructions casts further doubt on many of the features shown in the mapped reconstruction.

Impact of sensor distribution on mapped sea surface temperature patterns
There are regional patterns in the distribution of records derived from particular sensors (Fig. 5a).Given the discrepancies between reconstructions obtained with different sensors (Sect.3.3), this raises the issue of whether patterns in reconstructed SSTs (Sect.3.4) are an artifact of the distribution of sensors.For example, the east-west dipole in the Pacific during summer is based on planktonic foraminifera in the eastern and Mg / Ca SSTs in the western part of the basin.
Similarly, some of the noisiness apparent in regional reconstructions (e.g. in the mid-to high-latitude North Atlantic) clearly reflects adjacent sites where the records were derived from different sensors.Some patterns are entirely based on a single type of sensor and could be less apparent if other types of record were available.For example, the pattern of MH summer warming in the western Arctic is entirely based on dinocyst reconstructions while the cooling in mean annual temperature in the Indian Ocean is derived from only alkenone reconstructions.

Discussion
There have been several attempts to produce regional and/or global SST syntheses for the MH (Kerwin et al., 1999;Leduc et al., 2010;Ruddiman and Mix, 1993).Most of these have been based on one or (at best) two types of sensors, and have used different baseline climates for the calculation of anomalies, and are thus difficult to combine or compare.Here we have followed the MARGO LGM multi-sensor approach (MARGO Project Members, 2009) to produce a data set of MH SST anomalies.The reconstructed changes in SSTs are small, and rarely exceed the uncertainties of the measurements, and between-sample and between-site variability for a single sensor.Given that differences between the measurements obtained from different sensors are also large, and that only 9 % of the available cores have measurements on more than one sensor, then the patterns that emerge from the gridded maps are probably methodological artifacts.
The MH is a key period for climate model evaluation (Braconnot et al., 2012).Evaluations of the CMIP5 palaeosimulations indicate that the coupled ocean-atmosphere models are able to capture the very-large-scale pattern of climate change, and have some limited success in capturing different spa-tial patterns over the continents during the MH (Izumi et al., 2013;Li et al., 2013;Schmidt et al., 2013).However, evaluations using various different SST compilations, largely based on Mg / Ca and alkenone data, have shown there are significant mismatches between simulated and reconstructed SST (Schneider et al., 2010;Lohmann et al., 2013;Mairesse et al., 2013).Our evaluation of the large uncertainties associated with the MH SST reconstructions suggests that these mismatches may equally well reflect data uncertainty as model inadequacy.
Standardisation of laboratory techniques and/or calibrations could remove a large part of the between-site variability in SST reconstructions from an individual sensor.Rosenthal et al. (2004) have shown that the use of different cleaning methods introduces a bias of ±1 • C, while the use of different calibrations introduce differences of ±0.5 • C for Mg / Ca reconstructions.Similar problems affect the comparability of alkenones-based SST reconstructions and may be responsible for even larger differences between individual reconstructions (Rosell-Mele et al., 2001).
We have shown that the choice of baseline climate introduces uncertainty in both the magnitude and the spatial patterns of the SST reconstructions.Standardisation of the choice of baseline climate, as advocated by the MARGO LGM project (Kucera et al., 2005a), will remove one source of potential differences between different SST data sets.However, this does not mean that the resultant data set will be any more comparable to model simulations.There has been minimal consideration of whether reconstructed palaeoclimate anomalies are strictly equivalent to simulated anomalies, but our analyses show that the choice of a "modern" climate is crucial when the climate-change signal is small.Due to the dependency of MH SST anomalies to different baseline climates it may prove inadequate to use pre-industrial climates as reference state in MH model simulation.
The MH orbital configuration resulted in a seasonal cycle of insolation that is different from today and therefore should have had a larger impact on seasonal than mean annual SSTs.Thus, reconstructions of seasonal SSTs are likely to be more useful for model evaluation than reconstructions of mean annual SSTs.We followed the same approach as the MARGO project (Kucera et al., 2005a) to assign alkenonebased and Mg / Ca-based SSTs to specific seasons: Mg / Cabased SST reconstructions were assumed to provide summer temperature estimates and alkenones to provide estimates of mean annual temperature.These seasonal assignments are pragmatic, but Lohmann et al. (2013) have shown that it is possible to minimise apparent mismatches between simulated and reconstructed MH SSTs by accounting for possible shifts in the seasonality of plankton blooms or in the depth at which the plankton lived.The empirical evidence for seasonal representation is equivocal.Ecological considerations suggest most phytoplankton species bloom in the warmer part of the year and this will also be reflected in the abundance of the organisms that graze on them (e.g.Mohtadi et al., 2009;Wilke et al., 2009;Žarić et al., 2005).However, the Mg / Ca-based temperature signal is based on measurements from different planktonic foraminifera species, which potentially represent SSTs in different depth habitats of the ocean surface and/or seasons.Indeed, Mg / Ca-based SSTs have been interpreted as reflecting annual (e.g.Came et al., 2007;Eggins et al., 2003;Steinke et al., 2011) or seasonal SSTs (Hessler et al., 2011;Mohtadi et al., 2009;Steinke et al., 2008), depending on location, or as reflecting the season of upwelling in coastal regions (Farmer et al., 2008).Similarly, it has been suggested that the alkenone records represent warm season SSTs in high-latitudes and the cold season in low latitudes (Leduc et al., 2010;Schneider et al., 2010).However, Rosell-Melé and Prahl (2013) showed that there is no consistent and globally applicable seasonal pattern apparent in the alkenone flux to sediment.The use of statistical reconstruction techniques, applied here to reconstruct summer and winter SSTs from planktonic foraminifera census counts and dinocysts, does not solve the problem.The derived seasonal SST reconstructions are not independent but necessarily reflect the covariance among seasonal SSTs in the modern ocean (Kucera et al., 2005a).This is patently unlikely in the case of the MH and model analyses suggest that there were significant changes in seasonality even under LGM conditions (Izumi et al., 2013).As indicated in Sect.2.3 (defining the "sea surface"), the SST pattern reconstructed in this study is also likely biased by sensitivity of planktonic foraminifera assemblages to temperatures at different depths in the water column, as well as by changes in the SST sensitivity or recording depth of the other sensors and proxies between the present day and the early Holocene.The former is likely to be more significant because the recording depth of all other sensors and proxies used in this compilation is bound to have remained within the photic zone.
Changes in seasonality affect climate reconstructions based on terrestrial vegetation, and this has lead to reconstruction approaches that focus on bioclimatic variables more closely related to the physiological controls on terrestrial plant growth (Cheddadi et al., 1996) and more recently to the use of vegetation-model inversion as a reconstruction technique (e.g.Guiot et al., 2000).We suggest that both of these approaches could profitably be used to reconstruct SSTs, particularly since there are now both simple models (e.g.Geider et al., 1997) and more complex global ocean models that simulate the behaviour of plankton explicitly (e.g.Aumont et al., 2003;Le Quéré et al., 2005) based on the growing understanding of the ecology of individual plankton groups.Improved understanding of the ecology of different plankton groups, and how this could lead to changes in the seasonality, depth habitat, and adaptation to changing environmental conditions, could also provide insights into the causes of differences between the reconstructions obtained from different sensors (Leduc et al., 2010), thus allowing the reconstruction of more ecologically sensitive variables from existing data sets.
Although we applied several quality criteria for the selection of suitable records including a minimum requirement on the chronological control, differences in the SST pattern may be also related to chronological offsets between some cores.However, it is questionable how different the SST signal would actually have been even when considering a chronological error of 1000 years, considering the results of the simple exercise where we used two different definitions of the early Holocene time window.If, as we believe, the early Holocene SST signal was weak, then chronology alone is unlikely to explain the observed difference lest we have made such large error as to compare Holocene and Glacial sediments.
Our analyses were greatly facilitated by the fact that much of the primary data and the SST reconstructions are archived at, for example, Pangaea (http://www.pangaea.de)or NOAA's National Climatic Data Center (http://www.ncdc.noaa.gov/data-access/paleoclimatology-data).However, target data sets for model evaluation need to be comprehensive because regional and/or zonal signals could be significantly affected by data gaps.Following (Kucera et al., 2005a), we strongly urge the community to ensure that marine data and reconstructions are promptly archived in order that the modelling community can make full use of these resources.

Conclusions
There are multiple sources of uncertainties associated with SST reconstructions.The MH change in SST is small compared to the magnitude of these uncertainties.Thus, unlike the LGM, where robust changes in SST patterns emerge despite the methodological uncertainties (MARGO Project Members, 2009), a MH SST synthesis derived by the same standards as the MARGO compilation does not yet provide a reliable benchmark for model simulations.New approaches to SST reconstructions, including the use of inverse modelling, are required to improve this situation (e.g.Kageyama et al., 2013).The observed mismatches among the estimates of the different sensors indicate that something fundamental about the sensors ecology is not yet understood, which, however, will be essential to represent the sensors correctly in the models.
The Supplement related to this article is available online at doi:10.5194/cp-10-2237-2014-supplement.

Figure 1 .
Figure 1.Impact of using different modern reference climates on gridded (5 × 5 • ) mid-Holocene (MH) mean annual sea surface temperature (SST) anomalies: (a) difference between MH anomalies calculated relative to the 1998 version of the World Ocean Atlas data set (WOA98) and the 2009 version of this data set (WOA09), (b) differences in MH anomalies calculated using WOA98 and the Hadley Centre Sea Ice and Sea Surface Temperature (HADiSST) data set, and (c) differences in MH anomalies calculated using WOA98 and the Global database for alkenone-derived HOlocene Sea-surface Temperature (GHOST) data set.The histograms show the frequency distribution of MH anomalies in 0.5 • temperature classes reconstructed using each of the reference climates: (d) WOA98, (e) WOA09, (f) HADiSST, and (g) GHOST.

Figure 2 .
Figure 2. Between-sample variability in reconstructed sea surface temperatures (SSTs).(a) Reconstructed annual SSTs anomalies at individual sites with sample resolution of < 100 years in the 1000-year window from 6.5 to 5.5 ka BP used for mid-Holocene (MH) reconstructions.The grey bar shows the smaller 500-year window from 6.25 to 5.75 ka BP.(b) Standard deviation of mean annual SST anomalies within the 6 ± 0.5 ka BP and 6 ± 0.25 ka BP time windows at individual sites.(c) Comparison of observed standard deviation of SST and number of samples used to calculate the mean values within the 1000-year and 500-year windows.

Figure 3 .
Figure 3.Comparison of reconstructed annual, summer, and winter sea surface temperature (SST) anomalies at individual sites where reconstructions were made on at least two different sensors.The sites are arranged by latitude for convenience.

Figure 4 .
Figure 4. Comparison of reconstructed annual, summer, and winter sea surface temperature (SST) anomalies for different ocean basins using different sensors.The box-and-whisker plots show anomalies based on (a) using all samples that fall within the 6.5 to 5.5 ka BP time window for all of the individual records in a basin and (b) using the average SST anomaly for the 6.5 to 5.5 ka BP time window from each record.Only sensors that are represented by a minimum of three data points in any basin are plotted.The box-and-whisker plots were drawn using the Golden Software Grapher, which applies Tukey's method showing the 5 to 95 percentiles.The line shows the median.The whiskers are drawn down to the 5th percentile and up to the 95th.Points below and above the whiskers are drawn as individual dots.Outliers are calculated 75th (25th) percentile plus (minus) 1.5 × IQR (interquartile range).If this value is greater than (smaller than or equal to) the largest value in the data set, the upper whisker are drawn to the largest value.Any points greater (smaller) than 75th (25th) percentile plus (minus) 1.5 × IQR are plotted as individual points.The chance of finding an "outlier" by Tukey's rule in data sampled from a Gaussian distribution depends on sample size.

Figure 5 .
Figure 5. Gridded reconstructions of mid-Holocene (b) mean annual, (c) summer, and (d) winter sea surface temperature (SST) anomalies.The gridded values are averages of all records within the 5×5 • latitude/longitude grid.The map in (a) shows the distribution of reconstructions based on individual sensors.

Figure 6 .
Figure 6.Assessment of the signal-to-noise ratio in reconstructed sea surface temperatures (SSTs) at (a) individual sites where there are more than three samples within the 6.5-5.5 ka time window, (b) individual grid cells, and (c) individual grid cells where there are more than two records in the grid.Each plot shows the average change in SST compared to the standard error ( • C).The bars attached to each reconstruction represent the seasonally appropriate average measurement or calibration uncertainties on the sensor (foraminifera: ±1.35 • C winter, ±1.4 • C summer, ±1.3 • C mean annual; dinocyst: ±1.2 • C winter, ±1.6 • C summer, ±1.1 • C mean annual; alkenones: ±1.2 • C mean annual, Mg / Ca: ±1.2 • C summer).Each dotted line is defined by the anomaly ± the standard error, i.e. points that fall outside these lines (taking into account the measurement or calibration uncertainty) would be considered to show a significant anomaly at the 95 % confidence level.