A systematic comparison of bias correction methods for paleoclimate simulations

Even the most sophisticated global climate models are known to have significant biases in the way they reconstruct the climate system. Correcting model biases is therefore an essential step toward realistic paleoclimatologies, which are crucial for modelling long-term and large-scale ecological dynamics. Here, we evaluate three widely-used bias correction methods – the delta method, generalised additive models and quantile mapping – against a global dataset of empirical temperature and precipitation records from the present, the 5 mid-holocene (~6,000 years BP), the last glacial maximum (~21,000 years BP) and the last interglacial period (~125,000 years BP). Overall, the delta method performs best at minimising the median absolute error between empirical data and debiased simulations for both temperature and precipitation, although there is considerable spatial and temporal variation in the performance of each of the three methods. We indicate that additional empirical reconstructions of past climatic conditions might make it possible to soon use past data not only for the 10 validation but for the active calibration of bias correction functions.


Introduction
Realistic reconstructions of global paleoclimate are a key requirement for modelling many important long-term and large-scale ecological processes (Eriksson et al., 2012;Timmermann and Friedrich, 2016;Leonardi et al., 2018;Zhu et al., 2018;Rangel et al., 2018).Despite advancements in how complex physical processes are represented in global climate models, simulated present-day climate remains subject to substantial biases when compared to observational data (Solomon et al., 2007;Ehret et al., 2012).Depending on the region of interest, these biases can be of the order of a few degrees of temperature or centimeters of annual precipitation, which Whilst bias correction has received a great deal of attention for present-day and near-future simulations (Ho et al., 2011;Maraun and Widmann, 2018), work on paleoclimate reconstructions has been much more limited.This is partly due to the different time scale of paleoecological applications, for which computationally intensive bias correction methods that are used for the recent past and near future are not suitable.There are three main methods that have been used so far: the delta method (http://www.worldclim.org/downscaling),statistical methods based on generalised additive models (GAMs) (Vrac et al., 2007;Levavasseur et al., 2011;Woillez et al., 2014;Latombe et al., 2018) and quantile mapping (Lorenz et al., 2016).All three methods are based on the assumption that the biases between present-day observations and simulated data do not change through time, even though each method takes a different approach in the aspect that is assumed to be invariant.The delta method assumes bias to be location-specific (Maraun and Widmann, 2018), as it is based on a map of differences between observed and simulated values.GAMs attempt to represent local processes by finding statistical association between proxies for those processes (e.g.altitude, distance from the coast) and biases for presentday observation (Vrac et al., 2007;Maraun and Widmann, 2018).Finally, quantile mapping assumes the shape of the distribution of a certain variable to be constant through time (Maraun and Widmann, 2018).However, debiased simulation data have either not been validated against observational data at all, or only for a small geographical area and a single point in the past.Furthermore, because of the limited spatial resolution of many paleoclimate reconstructions, bias correction is conflated with downscaling, thus confounding any estimate of the actual effect of debiasing the data.
Here, we use a set of high-resolution climate simulations to systematically evaluate the performance of the delta method, a GAM-based approach, and quantile mapping, against a global dataset of empirical climatology data from the present, the mid-holocene (~6,000 years BP), the last glacial maximum (~21,000 years BP) and the last interglacial (~125,000 years BP).Thanks to the high resolution of our simulation data, we can isolate the effect of debiasing from downscaling.
Section 3 provides details of the three bias correction methods, the climate simulations, and the empirical paleoclimatology reconstructions used in this study.In section 4, we quantitatively assess the performance of the methods at a global scale, and with regard to spatial heterogeneities.Section 5 discusses how paleoclimate We used paleoclimate simulations of monthly temperature and precipitation at a 1.25 • ×0.83 • grid resolution for the present, the mid-Holocene and the last glacial maximum (LGM) from the HadAM3H atmospheric model, which is part of the family of HadCM3 climate models (Valdes et al., 2017).For the last interglacial, we do not have simulation data from HadAM3H, but we used the global climate model emulator GCMET (Krapp et al., 2019) that is based on the same model and can make predictions at the same spatial resolution.
Empirical data (see below) of continental temperature were compared against simulated temperature at 1.5 meters height, whereas simulated air surface temperature was used as a proxy for sea surface temperature, since sea surface temperature is not part of the HadAM3H output.We removed marine data points for which simulated air surface temperature was below the freezing point of saltwater, -1.8 • C, as in this case the simulated value corresponds to the temperature of an ice layer rather than that of the top layer of water.

Empirical climate data
All bias correction methods considered in this paper are calibrated on present-day observational data.We used monthly continental temperature and precipitation data at a 0.167 • grid resolution (New et al., 2002), and mean annual sea surface temperature at a 1 • grid resolution (Reynolds et al., 2002(Reynolds et al., ), representative of 1960(Reynolds et al., -1990. .These maps were aggregated to the 1.25 • ×0.83 • grid of the paleoclimate simulations by taking the average of values contained in each target grid cell. We used paleoclimate reconstructions of continental mean annual temperature, temperature of the coldest and warmest month, and annual precipitation for the mid-Holocene and the LGM from Bartlein et al. (2011), reconstructions of mean annual sea surface temperature for the mid-Holocene and the LGM from Hessler et al. (2014) and Waelbroeck et al. (2009), respectively, and reconstructions of mean annual continental and sea surface temperature for the last interglacial period from Turney and Jones (2010).

The delta method
The delta method assumes that the local (i.e.grid cell-specific) model bias is constant over time (Maraun and Widmann, 2018).For temperature, the local bias is given by the difference between present-day simulated and observed temperature, T obs (0) − T sim (0).Debiased temperature, T sim (t), at some time t is obtained by adding the bias term to the simulated temperature, T sim (t), at time t: (1) The second expression illustrates that T sim (t) is alternatively given by adding the simulated climate change signal to the present-day observed temperature.
Precipitation is bounded below by zero and covers different orders of magnitude across different regions.A multiplicative rather than additive bias correction is therefore more adequate when applying the delta method for precipitation (Maraun and Widmann, 2018).Analogously to temperature, debiased precipitation is given by This is equivalent to log-transforming simulated and observed precipitation values, applying the additive delta method, and back-transforming the result.

Statistical Models / GAMs
Statistical bias correction methods assume the existence of a functional relationship between climate model outputs as well as potentially additional forcings such as topography, and real climate (Vrac et al., 2007;Maraun and Widmann, 2018).Transfer functions representing this relationship are calibrated on the basis of present-day simulated and observed climate, and are then used to derive past climate based on the appropriate simulations.Generalised additive models (GAMs) have gained particular popularity as transfer functions (Vrac et al., 2007;Levavasseur et al., 2011;Woillez et al., 2014;Latombe et al., 2018).They accommodate potential nonlinearities in the response of the individual variables, but assume that the interactions between predictor variables can be neglected (owing to the computational requirements of general high-dimensional nonlinear regressions).GAMs compute the expected value of debiased local temperature (and, analogous, precipitation) as  2018), here we used elevation, the shortest distance to the ocean and simulated temperature as predictors for debiased temperature, and the elevation, the shortest distance to the ocean and simulated precipitation, temperature, wind speed, air pressure and relative humidity as predictors for debiased precipitation.The functions f i were estimated as piecewise third order polynomials (using thin plate splines did not change the results) using the mgcv package (Wood, 2004) in R.

Quantile mapping
This method involves mapping quantiles of the simulated distribution of a climate variable onto the appropriate observed present-day quantiles, so as to remove systematic distributional biases in the simulation data (Maraun and Widmann, 2018).Debiased temperature (and, analogous, precipitation) is calculated as where

Method evaluation
We assessed the local error between empirical and debiased simulated data at a given time t and location as E = T sim (t) − T obs (t) and E = P sim (t) − P obs (t) P obs (t) (5) for temperature and precipitation, respectively.Our primary measure for evaluating and comparing the performance of the three bias correction methods considered is the median of the absolute differences between empirical and debiased simulated data across all points for which empirical records are available.The median is weighted by grid cell area for the present, and by the available inverse standard errors for the past.For a given variable and point in time of interest, denote by E 1 , . . ., E n the local errors (Eq.( 5)) from all locations for which empirical data is available.Then the median absolute error is given by and where the weights w i are given by the inverse standard error associated with the appropriate empirical reconstructions, rescaled such that n i=1 w i = 1.We consider a bias correction method to overall improve the raw simulation outputs if the associated median absolute error is smaller than the median absolute difference between raw simulations and empirical data, which is calculated analogously.
Debiased simulated data should ideally not contain any systematic bias in that the absolute median error, should not differ significantly from zero.In addition to considering the median absolute error, we also examine how different methods affect the associated median error.

Results
Overall, the delta method provided the strongest reduction in median absolute bias (MAE, Eq. ( 6)) for all variables and past time points (Fig. 1), with the expection of precipitation at the LGM, where quantile mapping performs better.The relatively good performance of the delta method is also reflected in the correlations between present-day and past model bias (Fig. A6).Quantile mapping significantly increased original biases in continental mean annual temperature throughout the time periods considered, as well as in present-day and mid-Holocene precipitation.GAMs generally led to a reduction in bias, even though not as effectively as the delta method; however, for LGM continental mean annual temperature, the bias was actually increased.On average, raw simulations underestimated mean annual temperature and overestimated annual precipitation across time periods (Fig. A7).These trends were not completely removed by any bias correction method, although both the delta method and GAMs consistently reduced the original absolute median error.GAMs, in particular, minimised the absolute median errors for past mean annual temperature and mid-Holocene and LGM precipitation (Fig. A7).The performance of the different methods is not uniform across space nor time.Fig. 2 illustrates this heterogeneity for the delta method.For example, the delta method significantly reduced the original bias of modelled precipitation in Eastern North America in the mid-Holocene, but hardly improved the raw simulations in the Sahara, whereas the opposite pattern can be observed at the LGM.The performances of the methods relative to each other also varied significantly across both space and time.
While the delta method has a slight overall edge over the GAM approach, the comparison of the two methods in Fig. 3 shows that even within small geographical regions neither method performs consistently better than the other.Moreover, a better performance of one method in a specific location at a certain point in time generally does not guarantee the same result at a different time.For example, the delta method overall reduced the original error of modelled precipitation more than the GAM approach in Eastern North America during the Mid-Holocene, but less during the LGM.

Discussion
Whilst the delta method performs slightly better at debiasing temperature and precipitation compared to GAMs for the empirical data considered here, we note that this method is only appropriate for a given land conformation.Thus, it is only appropriate for the Late Quaternary, and even for this period, changes in sea levels are problematic as they expose areas for which we have no bias information as well as changing the areas affected by maritime climate.GAMs should, in theory, obviate these problems by quantifying local processes as statistical relationships with appropriate proxies.Whilst this approach might be the only option for the deeper past, our results point to the fact that reconstructing such local processes in such a way is challenging, as demonstrated by its inferior performance to the delta method.A possible limitation of GAMs as currently applied to bias correction and downscaling is that they assume additivity, thus estimating the effect of given proxies for the prevaling climate state observed at present day.By fitting interactions, it would be possible to allow for these effects to differ depending on the local climatic conditions, but the computational complexitiy of interactions with such large datasets is non-trivial.
A major limitation of current approaches to debias climate model data is that they all assume biases in presentday climate to be fully representative of the past.With the progressive increase in the number of empirical reconstructions of past climatic conditions, it might be possible to soon move from a situation where past data are use to verify correction schemes (as we did in this manuscript) to using those data to actively calibrate the bias correction function.time.However, uncertainties are large, patterns do not seem fully consistent across time, and available data points do not represent the world uniformly.A robust statistical model will require not only additional data from currently underrepresented geographical areas (specifically the southern hemisphere), but also curating empirical reconstructions, as successfully done for the last millenium (Hakim et al., 2016;Tardif et al., 2018).

Conclusions
Our comparison of global debiased paleosimulation data and empirical reconstructions suggests that, overall, the delta method provides slightly better performance at debiasing compared to GAMs, whilst quantile mapping is a poor choice for removing bias.Given our results, we suggest that the delta method is good starting point for bias removal of simulated Late Quaternary climate data, bearing in mind that its effectiveness varies regionally.

5
Whilst the datasets used in this paper are a step in the right direction, they are still too sparse and diverse for the purpose of actively parameterising debiasing functions.We believe that such a resource would allow bias removal methods to greatly improve in their effectivness.

1
Clim.Past Discuss., https://doi.org/10.5194/cp-2019-11Manuscript under review for journal Clim.Past Discussion started: 18 February 2019 c Author(s) 2019.CC BY 4.0 License.can make the difference between markedly different vegetation types (e.g. the shift from open to closed habitat, or the location of deserts).

2
Clim.Past Discuss., https://doi.org/10.5194/cp-2019-11Manuscript under review for journal Clim.Past Discussion started: 18 February 2019 c Author(s) 2019.CC BY 4.0 License.reconstructions could be used not only to evaluate methods, but to help estimate the variation of local model bias over time, thus combining the strengths of the delta method and statistical bias correction.
obs and F sim [t] denote the cumulative probability functions of the global set of observed values at present day and of simulated values at time t, respectively.Clim.Past Discuss., https://doi.org/10.5194/cp-2019-11Manuscript under review for journal Clim.Past Discussion started: 18 February 2019 c Author(s) 2019.CC BY 4.0 License.

Figure 2 .
Figure 2. Reduction of the original model bias by the delta method for continental and marine mean annual temperature and continental annual precipitation.The lower end of the colour scale was capped at -100% (i.e. a doubling of the original error).

Figure 3 .
Figure3.Relative performances of the delta method and a GAM approach in terms of debiasing simulated mean annual temperature (left column) and annual precipitation (right column).The colour spectrum represents the interval [0,1], and marker colours are calculated as the ratio of the absolute local error (Eq.(5)) of the GAM-based approach divided by the sum of the absolute local errors of both methods.

Figure 4 .
Figure 4. Differences between local past and present model bias (at locations for which reconstructions are available) against the local simulated climate change signal (i.e. the difference between past and present simulated value) of the variable of interest.Red, blue and green markers represent data from the mid-Holocene, the LGM and the Last Interglacial, respectively.Error bars represent standard errors of the empirical reconstructions.Lines show robust linear regressions.

Figure A1 .
Figure A1.Empirical reconstructions of continental mean annual temperature against raw and debiased simulation data.Lines show 1:1 relationships.

Figure A3 .
Figure A3.Empirical reconstructions of continental temperature of the warmest month against raw and debiased simulation data.Lines show 1:1 relationships.

Figure A4 .
Figure A4.Empirical reconstructions of continental temperature of the coldest month against raw and debiased simulation data.Lines show 1:1 relationships.

Figure A5 .
Figure A5.Empirical reconstructions of continental annual precipitation against raw and debiased simulation data.Lines show 1:1 relationships.

Figure A6 .
Figure A6.Comparison of present-day and past model biases in locations where reconstructions are available.Lines show 1:1 relationships.
where x 1 , . . ., x n are predictors provided by the climate model, such as simulated temperature and precipitation, or represent other known geographical or physical variables such as local elevation or the shortest distance to the ocean.f 1 , . . ., f n are generally nonlinear functions that are fitted using present-day observed and simulated variables.Debiased past variables are calculated by applying the fitted functions to past simulated variables.