Inconsistencies between observed, reconstructed, and simulated precipitation indices for England since the year 1650 CE

The scarcity of long instrumental records, uncertainty in reconstructions, and insufficient skill in model simulations hamper assessing how regional precipitation changed over past centuries. Here, we use standardised precipitation data to compare a regional climate simulation, reconstructions, and long observational records of seasonal (March to July) mean precipitation in England and Wales over the past 350 years. The Standardized Precipitation Index is a valuable tool for assessing agreement between the different sources of information, as it allows for a comparison of the temporal evolution of percentiles 5 of the precipitation distributions. These evolutions are not consistent among reconstructions, a regional simulation, and instrumental observations for severe and extreme dry and wet conditions. The lack of consistency between the different data sets may be due to the dominance of internal climate variability over the impact of natural exogenous forcing conditions on multidecadal time-scales. The disagreement between sources of information reduces our confidence in inferences about the origins of hydroclimate variability for small regions. However, it is encouraging that there is still some agreement between a regional 10 simulation and observations. Our results emphasize the complexity of hydroclimate changes during the recent centuries and stress the necessity of a thorough understanding of the processes affecting forced and unforced precipitation variability.

The Climate Explorer (http://climexp.knmi.nl/) provides access to a number of long series of monthly instrumental precipitation observations from the Global Historical Climatology Network (Peterson and Vose, 1997). We use those from Oxford, Kew Gardens, and Pode Hole in addition to the observationally derived Met Office Hadley Centre data sets. The Climate Explorer provides monthly data for these locations from 1697to 1999, 1726to 1994, and 1767to 1999. The later years in the Oxford record include missing values and we therefore only use data from 1767 to 1996 CE. 5 Frank et al. (2007) noted the uncertainties in early instrumental temperature observations. Additionally, the very early data in the Central England Temperature data includes non-instrumental indirect data to infer past temperature. Similarly, early precipitation observations require rigorous quality control (e.g., Burt and Howden, 2011). Woodley (1996) reviews the history of precipitation data for England and Wales as well as Scotland.

10
There are a number of gridded reconstructions of hydroclimatic parameters covering the European domain. Continental domain gridded precipitation reconstructions include Pauling et al. (2006), Casty et al. (2007), and Franke et al. (2017). Reconstructions of drought indices like the PDSI exist as gridded products, too, for various regions of the world including Europe (The Old World Drought Atlas, Cook et al., 2015). These products allow for assessments of the quality of the hydroclimate in paleoclimate simulations (Smerdon et al., 2015). 15 We decide to use regional precipitation reconstructions for our domain instead of gridded products to minimise the effect of the reconstruction method on the results. We focus on precipitation as it allows the direct comparison with long instrumental records and it is a parameter directly experienced by people.
To our knowledge, there are three precipitation reconstructions for small domains from southern Great Britain, i.e., approximately within the domain of the England-Wales precipitation and the Central England temperature. These are for East Anglia 20 (Cooper et al., 2013), for Southern-Central England (Wilson et al., 2013), and the reconstruction for Southern England by Rinne et al. (2013). The former two use tree-ring width data for their reconstructions, the latter uses tree-ring oxygen isotopes. There is additionally the work by Young et al. (2015), who scale a δ 18 O composite record from Great Britain to the England-Wales precipitation.
In the main manuscript, we only use the data by Cooper et al. (2013) and Wilson et al. (2013) for, respectively, East Anglia 25 and Southern-Central England in March, April, May, June, July (MAMJJ). Cooper et al. (2013) and Wilson et al. (2013) identified this extended spring as the season their tree-ring width records are sensitive to for their reconstructions of precipitation.
These authors calibrate their tree-ring data against gridded precipitation beyond their target regions of Southern-Central England and East Anglia, respectively. Thereby the reconstructions are possibly biased beyond their respective regions of interest.
They compare their reconstructions against the long instrumental records and find a lack of stability of the relation to the in-30 strumental data. They discuss the limitations of their reconstructions representing less than 40% of the regional precipitation variance over the 20th century. Obviously, the reconstructions suffer from the limited lengths of the available tree ring samples.
This may limit the resolution of precipitation variability at low frequencies in the reconstructions.
Although the reconstructions show a notable amount of low frequency variability, Cooper et al. (2013) cautions against too much confidence in the reconstructed low frequency precipitation variability. Cooper et al. (2013) explicitly call their work "preliminiary" with respect to reconstructing low frequency precipitation variability. Wilson et al. (2013) and Cooper et al. (2013) emphasize the weaknesses of their reconstructions in representing extreme years. On the other hand, both are confident in the mid-to high-frequencies of their reconstructions. 5 The authors note variable relationships between tree growth and environmental controls for their regions in the past. Indeed there are periods when the relationships between trees and precipitation are not significant. Wilson et al. (2013) and Cooper et al. (2013) discuss the possibility that the tree-species used for their reconstructions were less sensitive to precipitation over certain periods, e.g., the early 19th century. That is, the proxies, theoretically representing a precipitation signal, also contain a temperature signal, for instance, if they are sensitive to soil moisture. Wilson et al. further suggest an effect of the Industrial We think the focus on the tree-ring width based reconstructions is appropriate to present the possibilities of using the SPI and to highlight potential consistencies and inconsistencies between the different data sources. In the following, we compare the two reconstructions for southern Great Britain with the England-Wales precipitation observations.

Simulations
We compare the observations and the reconstructions to output from a regional simulation with the model CCLM for the 5 European domain over the period 1645 to 1999 as also used by Gómez-Navarro et al. (2014) and Bierstedt et al. (2016). We use output from 1652 onwards (Gómez-Navarro et al., 2014). To our knowledge, this simulation is one of only two transient regional simulation for this region and the last fast few centuries.
Forcing for the regional simulation is from a global simulation with the Max-Planck-Institute Earth System Model (MPI-ESM) in its Millennium-simulation COSMOS-setup. For details, see Jungclaus et al. (2010). This version of MPI-ESM couples 10 the atmosphere model ECHAM5, the ocean model MPI-OM, a land-surface module including vegetation (JSBACH), a module for ocean biogeochemistry (HAMOCC), and an interactive carbon cycle. For the simulation, ECHAM5 was run with a T31 horizontal resolution and with 19 vertical levels. MPI-OM used a variable resolution between 22 and 250 km on a conformal grid for this simulation. The ensemble used diverse forcings. The driving simulation for the regional simulation with CCLM is one MPI-ESM simulation with all external forcings and a reconstruction of the solar activity based on Bard et al. (2000), 15 i.e. with a comparatively large amplitude of solar variability. The regional climate model CCLM simulation (Wagner, personal communication; see also Gómez-Navarro et al., 2014;Bierstedt et al., 2016) uses adjusted forcing fields relevant for paleoclimate simulations as also used with the global MPI-ESM simulation. These include orbital forcing and solar and volcanic activity. Since the regional model does not represent the stratosphere, the regional simulation considers the effect of volcanic aerosols as a reduction in solar constant equivalent 20 to the net solar shortwave radiation at the top of the troposphere in MPI-ESM. CO 2 variability is prescribed and changes in greenhouse gases CO 2 , CH 4 , and N 2 O are based on data by Flückiger et al. (2002). Land-cover changes are included as external lower boundary forcing using the same data set as the MPI-ESM simulation (Pongratz et al., 2008). The presented CCLM simulation uses a rotated grid with a horizontal resolution of 0.44 by 0.44 degrees and 32 vertical levels. The sponge zone of seven grid points at each domain border is removed and fields are interpolated onto a regular horizontal grid of 0.5 by 25 0.5 degrees.
We choose the domain including grid points closest to the longitudinal and latitudinal borders 5.5W to 1.5E and 50.5 to 54.5N to represent the England and Wales precipitation domain. This selection is somewhat arbitrary but we assume it sufficiently represents the England-Wales precipitation domain to allow meaningful comparison of changes in percentiles, although not in absolute percentile values. We choose the domain 5 to 0W and 50 to 55N as simulated counterpart of the Central England Temperature. The simulated East Anglia series represents the domain 0E to 2E and 52N to 53N, and we choose the domain 2.5W to 0E and 51N to 52.5N as equivalent for Southern-Central England. All analyses are for the extended spring season, MAMJJ, since this is the seasonal focus of the reconstructions. The appendix provides a short evaluation of the simulation against the observational CRU-data (Harris et al., 2014) over the European domain. We do not apply any bias correction to the simulation output.
So far, global simulations for the last millennium have notably coarser resolutions than the 0.44 by 0.44 degree of the regional simulation we use here (compare, e.g., Fernández-Donado et al., 2013;PAGES 2k-PMIP3 Group, 2015). However, in contrast to present-day and future scenario regional simulations, a 0.44 by 0.44 degree resolution represents a comparatively coarse 5 resolution dynamical downscaling. As a review by Ludwig et al. (2018, including two of the present authors) highlights, this is because the demand for long simulation periods limits applications of regional models in paleoclimatology to relatively coarse setups. Thus, one may question the benefits of the approach compared to more recent higher-resolution global simulations, e.g., with the global models CCSM4 and CESM1 (Landrum et al., 2012;Lehner et al., 2015), which have resolutions of 0.9 • ×1.25 • . Sørland et al. (2018) discuss the benefits of regional climate simulations in studies on regional climates. Besides other 10 models, they also use CCLM in a 50km setup comparable to the simulation used here. They note that improved representation of regional climate in a regional simulation is not solely due to the increased resolution but may be due to different strategies in model-building and tuning. Pinto et al. (2018) explain differences in results from regional, including CCLM, and global simulations for southern Africa by an interplay between the representation of sub-grid-scale processes in the different models and factors related to the increased resolution. 15 Blenkinsop and Fowler (2007) find that regional climate models may be deficient in their ability to model persistent low precipitation episodes for the British Isles, which has repercussions for their representation of drought events. The review by Ludwig et al. (2018) reports more realistic distributions for precipitation in regional paleoclimate simulations. Flato et al. (2013, chapter 9 of the IPCC AR5) review the progress of regional downscaling and high-resolution modelling.
They emphasize that the skill of such exercises depends on the model used, the season, the domain of interest, and the con-20 sidered meteorological variable. They highlight studies showing that there is not a linear increase in simulation skill towards higher resolutions. Higher resolutions typically provide more reliable estimates of extremes, including heavy rainfall.
The quality of the simulated precipitation still strongly depends on the parameterisations implemented in the regional climate model. Precipitation, especially convective precipitation events, are still sub-grid processes, even within regional climate models. Concentrating on accumulated amounts on seasonal time-scales and their long-term changes, however, allows a more 25 robust comparison of simulated precipitation to observed and reconstructed data.

Methods
One objective of this manuscript is to highlight how the concept of the Standardised Precipitation Index (SPI, McKee et al., 1993) adds additional perspectives when comparing various sources of information for periods with and without instrumental observations. Therefore, we shortly introduce the SPI-transformation procedure and how we use this information to subse-30 quently compare precipitation estimates from observations, reconstructions, and a regional climate simulation.

The Standardized Precipitation Index -SPI
Standardising precipitation data facilitates comparing precipitation distributions between different locations, time-scales, periods, and data sources. For this purpose, McKee et al. (1993) introduced the Standardized Precipitation Index (SPI).
The Interregional Workshop on Indices and Early Warning Systems for Drought proposed the SPI as common index to facilitate comparability between meteorological drought estimates for different regions (Hayes et al., 2011, see also Keyantash 5 et al. (2002). The SPI should complement previously used indices. Raible et al. (2017) find the SPI to be a reliable drought index for Western Europe including the British Isles. The standardisation inherent to the SPI allows further applications, e.g., flood monitoring (Seiler et al., 2002), and the easy comparison of normal, dry, and wet conditions between different sources of data. Indeed the UK drought portal (https://eip.ceh.ac.uk/droughts) relies on the SPI. Sienz et al. (2012) discuss associated biases of the approach.

10
The SPI uses only precipitation, which makes it an ideal and relatively straightforward tool for comparing hydroclimatic data between different data sources. Precipitation is a standard output of simulations, long instrumental records exist for various locations, and a number of reconstructions exist as well.
However, as the SPI uses only precipitation, it is of less value when the interest is in, e.g., the water supply, runoff, or streamflow (but see Seiler et al., 2002). The focus on precipitation also limits the applicability for studying temperature sensitive 15 parts of the hydrological cycle and impacts on biological and anthropogenic systems (e.g., PAGES Hydro2k Consortium, 2017; Keyantash et al., 2002;Hayes et al., 2011;Van Loon, 2015).
Previous usage of the SPI in paleoclimatology focussed on the index series. For example, Domínguez-Castro et al. (2008) and Machado et al. (2011) compare SPI-series to differently derived hydroclimatic indices over approximately the last 500 years. Other studies reconstructed the SPI instead of absolute precipitation amounts (e.g. Seftigen et al., 2013;Yadav et al., 20 2015;Tejedor et al., 2016;Klippel et al., 2018). Lehner et al. (2012) use the SPI to compute pseudo-proxies from re-analysis data and long simulations with global climate models to test a reconstruction-method.

Transformation
The SPI requires fitting a distribution function to the precipitation data and there are various candidate distributions (e.g., Sienz et al., 2012;Stagge et al., 2015, and their references). In our analyses, we fit a Weibull distribution. Sienz et al. (2012) 25 highlighted that the Weibull distribution performed better in transforming the England-Wales precipitation data on a monthly time-scale compared to a number of other distributions. However, other distributions outperformed the Weibull distribution for other data sets and other SPI time-scales. Results differ only little if we fit Gamma or Generalised Gamma distributions (not shown). Our procedure of the SPI-calculation follows the detailed description by Sienz et al. (2012). McKee et al. (1993) recommend at least 30 data points for successful distribution fits, but Guttman (1994) notes the lack 30 of stability for small sample sizes. We fit distributions over sliding 51-year windows. Thus, we use more data points than recommended by McKee et al. (1993) but still less than the 60 points for which Guttman (1994) finds convergence of higher order L-moments. Appendix Figure B1 shows 95% intervals of a bootstrap procedure sampling 1000 times 40 data points from each window and fitting distributions to these samples. The choice of 40 data points is an ad hoc decision. We could also have chosen sample sizes of 49 data points.

Evaluation
Standardising precipitation data at least can attenuate some of the problems mentioned in the introduction. Transforming precipitation to standardised values provides further means to study the agreement or the lack thereof between different data 5 sources.
By transforming to the SPI over moving windows, we essentially compare climatologies and potentially filter shorter term internal variability. If the climatology for the observations is the target climatology, an ensemble of climate simulations should sample this distribution following the paradigm of a statistically indistiguishable ensemble (Annan and Hargreaves, 2011). Our analyses compare how well the climatologies agree in different sources of data.  -Navarro et al. (2015) give some indications that this expectation may be warranted. In the worst case, our analyses can point out that one of the data sources contradicts the others.
For any given window, the fitted distribution parameters allow for calculation of various properties. For example, we can consider the changing amount of precipitation, which one would describe as average, extremely high, or extremely low for subsequent periods. In the SPI-literature, the 6.7th and 93.3th percentiles traditionally represent the regions of severe (and 20 extreme) dryness/wetness of the probability density function. Accordingly, we subsequently compare 6.7th and 93.3th percentiles for the fitted distributions over time. Further, we can compare the moments of the distributions. We choose to show the square-root of the Weibull distribution variance, i.e., the Weibull standard deviation over sliding windows. This provides an additional clarification of how the precipitation distribution changes over time. The Appendix C shows parameters for the distribution fits. 25 The fitted parameters allow further analyses, e.g., we can compare how likely a reference amount of precipitation is for different periods. We do this for 50th, 6.7th, and 93.3th percentiles in a reference year. We choose 1815 CE as the reference year, since it is included in all data sets and it allows potentially equivalent analyses of the PMIP3 past1000 simulations (e.g., Schmidt et al., 2011).

30
Performing the transformation to standardised precipitation over 51-year windows results in smoothed estimates. For convenience, we additionally plot smoothed time series in a number of figures. Filtered series are solely used for visualisation.
We use a Hamming window. In most cases, this has a length of 51 points but we also occasionally use different window lengths. The 51-point Hamming filter represents a different frequency cut-off than a simple 51-year moving median or moving mean as can be obtained from fitting the distributions over 51-year moving windows. 19th and mid-20th centuries (see Figure 1b). The instrumental data for Oxford appears to agree better with the data for Kew Gardens, which is to be expected from the geographic locations of the stations. Visually, both reconstructions agree less well with the observational series and with each other than the agreement amongst the observational data (see Figure 1c). This holds for their variations and their overall level of variability. Figure 1d adds the Central England temperature data for MAMJJ for completeness sake. 20 Correlation matrices ( Figure 2, and supplementary document) and scatterplots (see supplementary document) emphasize the differing agreement between the various data sources even more clearly. Figure 2 presents the correlation matrix for complete observations, i.e. for the period 1873 to 1994 when all records have data. Correlation coefficients change slightly if we consider pairwise complete records. Relations among precipitation data sets are always positive. They are very strong between the England-Wales data and its subdivions, between the Kew Gardens series and the South-East England data, between the 25 Pode Hole series and the Central England data, and between the Oxford record and the South-East England data as well as the England-Wales precipitation. The relationship between the two reconstructions is also rather strong over the sub-period.

Relations among data sets
Correlations are, however, weaker between the reconstructions and the observed series.
There is a generally negative relationship between the Central England temperature and the precipitation data sets for the chosen extended spring season from March to July. It is weakest for the Southern-Central England reconstruction but also rather relations are stronger for the observationally based data from the Met Office Hadley Centre and the instrumental series for the summer season June to August (not shown).
Correlations for non-overlapping 11-year averages are positive and strongest between the England-Wales precipitation and the two instrumental series (not shown, see supplementary document, calculated for the period 1767 to 1986). This analysis also gives reasonable correlations (r ≈ 0.51) between the pair of reconstructions and between the instrumental series. Otherwise, 5 correlations are weak. Correlations for the extended spring season with the Central England temperature data are largest for the non-overlapping 11-year averages of the Kew Gardens instrumental series. We choose 11-year non-overlapping windows to balance the number of available data points and the filtering of interannual variability.
4.1.2 (Paleo-)observational data and regional simulation output Figure 3 presents the two reconstructions and the England-Wales precipitation in comparison to the respective data from the regional simulation. All data are again for the extended spring season from March to July (MAMJJ), and the panels zoom in on the period of the regional simulation. We show the interannual time series and the 51-point Hamming-filtered representation.
Considering the evolution of the records, the 51-point Hamming-filtered time series show pronounced differences besides some common features for the reconstructions for Southern-Central England (Wilson et al., 2013) and East Anglia (Cooper et al., 2013) (black lines in Figure 3a and b) similar to the representations in Figure 1. Both reconstructions feature a relative 15 precipitation minimum centered on approximately the year 1800. The Southern-Central England reconstruction additionally displays a relative minimum in the early 20th century.
The observed England-Wales precipitation is available at monthly resolution from the year 1766 onward. The Hammingfiltered time series shows markedly less multi-decadal to centennial variability compared to the reconstructions, but the observations have much more interannual variability than the reconstruction for East Anglia and slightly more variability than 20 the reconstruction for Southern-Central England (Figure 3c, black line). The filtered England-Wales time series also displays a slightly negative trend.
Differences between the simulated regional records are generally smaller (blue lines in Figure 3). Existing differences highlight the spatial heterogeneity of precipitation, e.g., interannual pairwise correlation coefficients are about 0.9 between the simulated East Anglia data and the other two records, while the simulated England-Wales precipitation correlates at approx- 25 imately 0.97 with the simulated Southern-Central England data. Absolute interannual precipitation differences between the three data sets are at a maximum approximately 151 mm/season (not shown). A general feature for all regions is that excursions of the filtered simulation output often, but not always, are opposite to those of the reconstructions or observation time series.
There is an obvious bias in the absolute amounts between the simulation output and the other data sets. The simulation output 30 series give larger precipitation amounts. We do not try to attribute this difference. We note that it is not as prominent for the more local comparison with the data from Rinne et al. (2013) for May to August and the bias is generally slightly negative for the summer season June to August for England-Wales precipitation (not shown, see supplementary document). We assume that the differing spatial representations sufficiently explain the mismatch. However, the change of sign in the bias for the summer season suggests that the simulation overestimates spring precipitation, underestimates summer precipitation, and the positive spring bias is larger than the negative summer bias. See also Appendix A for a comparison of the simulation to observational data over the full European model domain. Figure 3 shows a common feature for all three comparisons. Simulated records appear to show opposite evolutions compared to the (paleo-)observations overall but particularly in the late 18th to early 19th century and in the early to mid-20th century.

5
This initial comparison already shows varying levels of agreement for the chosen data sets derived from observations and the reconstructions. It highlights that the relationship between the reconstructions and the observational data sets are weaker than between the instrumental data and the observational indices on interannual time-scales. Note that the regional observational indices include information from the instrumental data. On longer time-scales the reconstructions align less well among each other than the observationally derived time series. However, the local, purely instrumental series also show more disagreement 10 among each other than the derived larger domain products. Filtered regional time series evolve often visually oppositely in the simulation compared to the reconstructions and the observations. So far, we used the precipitation and temperature data. In the following, we mainly use the information obtained via the transformation to standardised precipitation indices.

Comparing standardised precipitation data
15 Figure 4 to 6 add, respectively, the comparisons of the wet, i.e. 93.3th, percentile, the dry, i.e. 6.7th, percentile, and the square root of the Weibull distribution variance to the comparison of the interannual and filtered time series in the previous section.

Observations vs. Reconstructions
Since they represent different regions, we do not expect agreement in the absolute precipitation amounts representing wet conditions between the England-Wales precipitation data and the reconstructions in Figure 4a. We note that the difference 20 between the wet percentile for the England-Wales precipitation and the reconstructions is larger than for the average amounts, indicating a wider distribution for the data based on instrumental observations. Precipitation histograms confirm this (not shown). On the other hand, differences are smaller for the dry percentile ( Figure 5). Nevertheless, this is a sign that the reconstructions underestimate the width of the precipitation distributions of 51-year window climatologies. The opposite trends in the wet percentiles mean that the wet percentile represents lower precipitation amounts in the middle 30 of the 20th century compared to the late 18th century, while the reconstructed wet percentile represents larger precipitation amounts in the middle of the 20th century compared to the late 18th century (Figure 4). Similarly the opposite multidecadal variability in the dry percentiles of reconstructions and observations means that when the reconstructions represent a drying of the dry percentiles, the observations indicate the opposite and vice versa ( Figure 5). Generally, the series for the severe to extreme dryness and wetness percentiles reflect the smoothed evolution of the respective data set before transformation into a distributional form (compare Figure 3).
We note that the data of Rinne et al. (2013) for Southern England in summer display an apparent opposite evolution of wet percentiles for the period of overlap between reconstruction and observations from the late 18th to the late 19th century. On 5 the other hand dry percentiles agree well over this period (not shown, see supplementary document).
Parameters for the fitted distributions also allow us to evaluate the moments of the distributions. Estimates for the Weibull standard deviations (SD in Figure 6) differ between observations and reconstructions as expected from the previously noted differences in percentiles. The reconstruction for East Anglia does not show a clear evolution in the Weibull standard deviations, whereas there is an increasing trend in the Weibull standard deviations for the Southern-Central England data. The observations 10 show a slight reduction in the standard deviation until the middle of the 20th century, with a strong increase afterwards.

Simulation output
The simulated time series in Figure 3 show large similarities between regions. This is also the case for the wet and dry percentiles as well as for the standard deviations. Indeed, the respective statistics evolve simultaneously among the different regions, and the standard deviations overlap (Figures 4 to 6). 15 Thus, differences between regional domains are smaller for their simulated representations compared to the observed or reconstructed records. They are slightly more notable for the moving window statistics compared to the Hamming-filtered series. Dry percentiles are very similar for East Anglia and for Southern-Central England in the simulation but wet conditions require larger precipitation amounts for Southern-Central compared to East Anglia. Appendix B highlights that this may be due to sampling variability. Smoothed simulated data and wetness percentiles evolve similarly, but opposite evolutions of the 20 dryness and wetness percentiles result in widening and shrinking of the distributions after approximately the year 1800.

Simulation output vs. observationally derived data and reconstructions
Simulations and reconstructions do not agree on the time evolution of precipitation percentiles (Figures 4 to 6). Any hint of an agreement between reconstructed and simulated data is likely due to randomness (compare Figure 4). There is instead a tendency towards opposite time evolutions between the data sources. This is best seen in the dry percentiles from the mid-18th 25 to mid-20th century ( Figure 5).
This apparent opposite evolution is the most common feature when comparing the percentiles derived from the simulation and from the reconstructions. When the percentile series for the reconstructions show minima, the simulation commonly shows maxima and vice versa. Obviously, using an ensemble of regional simulations would show a range of trajectories. Therefore, these results do not preclude that the model is capturing basic physical characteristics of precipitation variability in northwestern We note that there is neither any clear commonality nor any overly opposite evolution in the dry percentiles when comparing the regional simulation to the reconstruction for Southern England summer precipitation by Rinne et al. (2013, not

Changes in probability of certain precipitation amounts
In the methods section, we describe the procedure for calculating standardized precipitation indices over moving time windows.
We obtain a distribution fit for each time window. The parameters of the fit for a window allow us to identify the probability of a precipitation amount for the respective window. these changes for the precipitation amounts representing the 93.3th, 50th, and 6.7th percentiles, respectively, in a reference window. For this comparison, the reference is the distribution of precipitation in the window centered around the year 1815 CE. The year 1815 CE is included in all data sets and it allows equivalent analyses of the PMIP3 past1000 simulations (e.g., Schmidt et al., 2011). We estimate and plot the percentiles that correspond to these reference precipitation amounts in other time windows. 25 The England-Wales precipitation shows a slight increase over time in the reference 93.3th percentile in the year 1815 CE ( Figure 7a). Recently, there is a steep decrease in the series. Similarly, the 50th percentile for 1815 CE represents slightly larger percentiles over time (Figure 8a). On the other hand, there are weak multi-decadal variations in the series for the 6.7th percentile in the observations, and the 6.7th percentile from 1815 CE may become slightly less likely over time (Figure 9a).
Before turning to the reconstructions, we shortly note that the simulations show similar trajectories for all three percentile 30 values and all three regions. There are not any obvious trends, but the series show multidecadal variations. The window centered in the year 1815 CE falls within a minimum or at the end of a minimum. The respective precipitation amount generally represents larger percentiles before the time window centered in 1815 CE. After this time window, the 6.7th and 93.3th percentiles both approach a maximum in the series (Figures 7b and 9b). However, the 93.3th percentiles reach it about the year minimum. Thus, the wet and dry percentiles evolve oppositely from the early 19th century onwards, i.e. the distribution widens and shrinks since approximately the year 1850 CE. The amount of precipitation, which represents median values for the reference year 1815 CE, is representative of larger percentiles in later years (Figure 8b). However, there is a slight decreasing trend from approximately the mid-19th century to the end of the simulation (Figure 8b).

5
The reconstructions for East Anglia and Southern-Central England have some peculiar features (Figures 7a to 9a). For one, it is not ideal to choose a reference year from the period around 1800 CE. The 6.7th percentile in 1815 CE is much less likely earlier and later in both regions (Figure 9a). Similarly, average precipitation around 1815 CE represents approximately the 20th percentiles in earlier and later periods for East Anglia (Figure 8a) but also represents much smaller percentiles in later periods for Southern-Central England. Severe and extreme wet conditions from this period may even represent long-term 10 average conditions for East Anglia (Figure 7a). We note that comparisons to the data by Rinne et al. (2013) do not feature such peculiarities (not shown) but using a simple scaling approach for the δ 18 O data of Young et al. (2015) gives similar results (not shown, but compare information given in the supplementary document).
In general, there are not any clear common evolutions between the different data sets before the 20th century. Only the dry percentiles in the simulation and the observations may evolve similarly in the period of their overlap ( Figure 9). Interestingly, 15 there is an apparent contrast between simulation and reconstructions with potentially opposite evolutions in the period of their overlap prior to the 20th century in all shown series. In the 20th century, on the other hand, some commonalities may be inferred at least for the series representing the reference 93.3th percentile (Figure 7).
Most prominent in these analyses is that the distributions for reconstructed precipitation show large shifts to larger precipitation amounts compared to the simulations and observations. In contrast, the simulation and observations vary only within 20 a rather narrow range. This may relate to the weaknesses of the reconstructions in representing not only low-frequencies but also extremes (compare Cooper et al., 2013;Wilson et al., 2013). The regional simulation and the reconstructions again show an apparent opposite evolution for East Anglia and Southern-Central England. All sources of information tend to show shifts in the probability of precipitation amounts. 25 We briefly explore the interrelation between the regional temperature and precipitation variability focussing on the extended spring season from March to July. In particular, we show how interannual correlations between the precipitation records and temperature series evolve over time for this season. Figure 10 plots sliding interannual correlations for 51-year windows between the observed and reconstructed precipitation data and the Central England temperature as well as the correlation between simulated England-Wales precipitation and sim-30 ulated Central England temperature. We plot correlations for the untransformed precipitation records. All records are for the MAMJJ-season. Obviously, the large amount of internal variability on local and regional scales complicates the comparison among different data sources when studying such small regions.

Relation between Temperature and Precipitation in different data sources
We expect variability of moving correlation coefficients simply due to sampling variability (Gershunov et al., 2001). For example, a bootstrap procedure following Gershunov et al. (2001) suggests a 90% credible interval for 51-year moving window correlations of between approximately −0.59 and approximately −0.21 for a correlation of approximately −0.43 between simulated Central England Temperature and England-Wales precipitation over the full period. That is, variations in Figure   10 are probably within the sampling variability estimates for 51-year moving window correlations. That further implies that 5 for overall uncorrelated data we can expect some windows to show statistically significant correlations. We do not show significance levels in Figure 10 but we note that for 51-year windows and the time series characteristics of the data (e.g., approximately uncorrelated noise for seasonal precipitation), one may regard absolute values of correlation coefficients larger than 0.23 as statistically significant at the 5% level.
On interannual timescales and over 51-year moving windows, all data sets evolve similarly in Figure 10 for the extended 10 spring season. However, observed and reconstructed data show weaker correlations in the late 20th century, while the correlation strength increases in the regional simulation. Both reconstructions do not show any statistically significant relation between temperature and precipitation over the full period. The reconstruction for East Anglia is intermittently negatively correlated with the temperature data. The observations show a notable negative relationship from the second half of the 19th to the mid-20th century. Only correlations between the regional simulation temperature and precipitation are negative and relatively 15 strong (r ≈ 0.5) throughout the full period.
The observed negative relation is well known. For example, Crhová and Holtanová (2018) show a slightly negative correlation between temperature and precipitation in observations over the southern British Isles in spring and summer. They also show that regional climate simulations usually capture this feature successfully. 20 Here, we briefly describe additional results. If we perform similar analyses as described above but on a selection of the PMIP3ensemble of global simulations (Schmidt et al., 2011), we do not find commonalities between the simulations or between the simulations and the other sources of information (not shown, see supplementary document). If we use different reconstructions, agreement between simulated and reconstructed precipitation does not necessarily increase, but differences between reconstructions and observations may be reduced (not shown, see supplementary document). 25 We use two different reconstructions based on δ 18 O. For one, we obtain the precipitation reconstruction by Rinne et al. (2013) for Southern England for the May to August extended summer season. Secondly, we use the isotope records for England and Wales by Young et al. (2015) and scale the composite against the observational England-Wales precipitation data. We follow the procedure described by Young et al. (2015) but for two seasonal estimates, the extended spring from March to July used in our main analyses and, following Young et al. for the summer season from June to August.

30
The supplementary document provides some details for our summer season scaling of the isotope data of Young et al. (2015).
The most striking feature is again a notable difference in the percentiles prior to time windows approximately centered in the year 1850 compared to the later period. This feature resembles the behavior of the tree-ring width based reconstructions. While this may be due to the chosen calibration method and period, it appears more likely that there is a problem in the relationship between isotopes and precipitation for this early period.
Comparing our extended spring season scaling to the equivalent observations, there is limited agreement for the dry percentile after approximately the year 1850 (not shown) but otherwise we cannot find any consistency of this data compared to the observational counterparts. We also see no agreement between the data by Young et al. and the regional simulation output.

5
The period covered by the data of Rinne et al. (2013) only shortly overlaps with the period of the observational data. For this overlap dry percentiles tend to agree with the observations but wet percentiles evolve oppositely (compare supplementary document). The change in average precipitation for a reference year also agrees between both data sets for the time of overlap (not shown). Compared to the regional simulation output, evolutions tend to be opposite.
If we consider the relation between temperature and precipitation in the additional data sets and their respective seasons, the 10 disagreement between data sources changes compared to our main analyses (not shown). The observations show consistently negative correlations for the summer season, and the scaled isotope data by Young et al. (2015) agrees quite well with the summer observations except for a large part of the 20th century when it shows a markedly weaker negative correlation (not shown).
The simulation again shows generally stronger correlations compared to the other data in summer and shows some agreement with the observations in the industrial period since approximately the year 1850 (not shown). If we correlate the scaled isotope  The observational England-Wales precipitation data is a weighted composite of regional series based on instrumental information. The information entering the composites and the regional index changed over time. Similarly, the reconstructions combine spatially distributed proxies, e.g., tree-ring width series into regional scale composite series (Cooper et al., 2013; similar effects in removing local variability. In this respect, records from different sources are similar to each other and thus our comparison appears valid. Explicit uncertainty estimates are only available for the reconstruction for East Anglia and only for a low-pass filtered version of the data (Cooper et al., 2013). Our results as well as the discussions of Cooper et al. (2013), Wilson et al. (2013), Rinne et al. (2013, and Young et al. (2015) emphasize that uncertainties for the reconstructions are potentially large and that even the 5 relationship to precipitation is not necessarily valid for some periods. Similarly, uncertainties affect the simulations not only with respect to our domain choice but also with respect to the algorithms and parametrisations implemented for simulating precipitation in the regional climate model.
Considering the limitations of any simulation and the known shortcomings of the reconstructions, questions may arise as to the validity and robustness of our analyses. Even if one assumes that prior discussions on the reconstructions invalidate their 10 use, they would at least be a useful data source for our first goal of highlighting the benefits of adding the SPI to our set of tools for studying past precipitation variability.
However, we do not agree with such an assumption. The reconstructions are still, at least 'preliminary' (as stated by Cooper et al., 2013), estimates of past precipitation for the southern British Isles. As such, it is of value to include them in a comparison of distributional precipitation characteristics between different data sources for this domain. It is further of interest to highlight 15 for any available reconstruction in which properties the reconstructed precipitation distributions agree or disagree with the other sources of information. That is, understanding our sources of information about past climates requires the identification of their strengths as well as their shortcomings.
More generally, we argue that the transformation to standardized indices provides a sound basis for equivalence between the different precipitation estimates for subsequent comparisons of the distributional properties. Then, we assume that the 20 comparison becomes informative for changes over time between these distributions. While we cannot expect accurate or even approximate temporal agreement between time series from simulation output and observation based data on either interannual or multi-decadal time-scales because of internal variability, the transformation makes our comparison one of climatologies.
Furthermore, one may assume that the evolution of percentiles and variability may be more consistent between the different data sets than the average conditions. 25

Implications of the main results
Our analyses highlight the shortcomings of different reconstructions relative to observations. We also see that differences as compared to observations may be comparable for reconstructions and simulations. Our approach further shows that apparently the reconstructions and the simulations occasionally evolve in opposite directions. This may signal that we indeed do not perform a valid comparison, that simulations may misrepresent forced responses, or, considering the relationship between the 30 reconstructions and temperature, that the reconstructions do not fully reflect precipitation.
We expect disagreement between simulations and observations not least because of differing influences of internal variability (see discussions below). More critical is the lack of consistency between reconstructions and observations. Most notably the reconstructions show unrealistically large changes in the cumulative probabilities represented by certain precipitation amounts for the extended spring season MAMJJ (compare Figures 7 to 9). The reconstructions do not reliably represent the extended spring precipitation distributions in specific periods.
One result is the inconsistency of the relationships between temperature and precipitation in the data sets for the considered domains for the extended spring season. Tout (1987) and Crhová and Holtanová (2018) both note the negative relationship between temperature and precipitation observations for Britain. Tout (1987) does not find any changes in the negative relationship 5 between England-Wales precipitation and Central England Temperature for the summer season from June to August between 1766 CE and 1980 CE. We find the negative relationship for the extended spring only consistently in the simulation, and from approximately 1850 CE to 1950 CE also in the observations. The tree-ring width based reconstructions do not show any clear relationship for the extended spring season. The disagreement between data sets changes for other seasons (not shown).
The differences the between simulation and observations may imply either shortcomings of any of the observational data 10 sets in the early period or that the simulation presents a too stable relationship between temperature and precipitation in southern Great Britain. Explanations might be physical inconsistencies within the simulations. More generally, any of the data sources may lack the physical relationship between the temperature and precipitation records in the chosen season. Another possibility is that internal large-scale climate factors influencing the relationship between both parameters evolve differently in the simulation and reality. Assuming that the observations are the more reliable data set, we tend to the inference that the 15 disagreement between observations and reconstructions suggest major shortcomings in the reconstructions.

Internal vs. forced variability
If we expect temporal consistency among the different sources of information, then we are assuming that all the sources of information are responding to the impact of external climate forcing, and that the regional simulation skillfully represents the climate response to these conditions. Nevertheless, internal climate variability may dominate even for large amplitude 20 exogenous forcing (compare, e.g. Deser et al., 2012a). We have to ask, what is our expectation of consistency between simulated and observed responses to exogenous influences?
The instrumental period overlaps with the industrial period of anthropogenic climate forcing. Earlier exogenous forcing is potentially weak despite relatively large variations in solar activity (Clette et al., 2014), and the occurrence of a number of strong tropical volcanic eruptions during the period of interest (e.g., Schmidt et al., 2011). 25 Forced precipitation signals can agree in simulations, e.g., the CMIP5 21st century global projections (Fischer et al., 2014).
A lack of an identifiable relationship to the forcing between different data sources in our study does not necessarily imply that the underlying climate data are wrong but may simply suggest that internal, e.g., oceanic, atmospheric, or coupled climate variability masks, modulates, or counteracts an external forcing influence. That is, the lack of consistent evolutions points to shortcomings of the data sources or an overwhelming influence of internal variability. We have to emphasize that the regional 30 simulation and its driving MPI-ESM-COSMOS simulation both use variations of the total solar irradiance forcing that could be unrealistically wide, and neither simulation includes a resolved stratosphere to account for potential UV-related top-down mechanisms (Anet et al., 2013(Anet et al., , 2014. In addition, our regional focus is close to the western boundary of the domain of the regional simulation, and, thus, we expect a rather strong influence of the dynamical evolution of the driving coarse-resolution simulation with MPI-ESM-COSMOS. (2007) report a strong influence of the driving general circulation model on the representation of drought in regional climate simulations in southern Great Britain.

Indeed, Blenkinsop and Fowler
Relatedly, since the regional focus is a small domain, the influence of natural internal variability is likely large, e.g. in the 5 case of the British Isles, variability in the North Atlantic Oscillation (Gómez-Navarro et al., 2012;Gómez-Navarro and Zorita, 2013;Hall and Hanna, 2018;Matthews et al., 2016). Thus, we should not expect simulations to agree with observations on the evolution of regional climate parameters and even an ensemble may show diverse behavior. Differences in internal variability between models, observations, and paleo-observations may include their representation of past changes in the relationship between the regional climate and the large-scale circulation (Pinto and Raible, 2012;Lehner et al., 2012;Raible et al., 2014).

10
Thus, while the forcing history suggests notable variations, and the large-scale temperature records indicate an imprint of the forcing history on hemispheric and global temperatures, internal variability may dominate on smaller regional scales (e.g., Deser et al., 2012b). This is despite the fact that, e.g., the large scale storm track is indeed sensitive to solar (e.g., Ineson et al., 2015) and volcanic forcing (e.g., Fischer et al., 2007;Trouet et al., 2018). Considering the possibly large role of internal variability on regional scales and the limitations of simulations in representing regional scale precipitation, the occasionally 15 consistent variations in precipitation distribution properties increase our confidence in simulated forced changes. However, while the regional simulation appears to present similar variations compared to the observations during some periods, we cannot say whether it does so for the right reasons.

Conclusions
This study pursued two goals. For one, we wanted to show that the Standardized Precipitation Index (SPI) over moving 20 windows helps in the rigorous comparison of different sources of precipitation information over paleoclimatic time-scales.
The information on precipitation distributions obtained by the SPI-approach eases comparison of how different sources of information represent climatologies of precipitation. Second, by using this approach, we studied the consistency of the various sources of information for precipitation variations in a small regional domain in southern Great Britain.
Regarding the results for our specific study domain, first we did not find any clear consistency for precipitation signals 25 among a regional climate model simulation, an observational data set, and two local domain reconstructions. We conclude that the considered reconstructions appear to be unreliable representations of the observational series.
Second, the regional simulation shows occasional agreement with its observational target, the observational England-Wales precipitation data. In particular the variability in both data sources shows comparable changes for the full period of the observations. This is possibly due to comparable changes in dryness, which also show some level of agreement over the full period.

30
This partial agreement between variability and dryness of the regional simulation and observations is encouraging. However, considering all associated uncertainties, we can not conclude that the agreement in properties does reflect agreement in the underlying processes in the respective data sources.
Third, the simulation data does not agree with the reconstructions. Nevertheless, an interesting result is the at times opposite evolution of the reconstructions and the regional simulations considering regional dryness and wetness, e.g., between 1750 and 1850. Again, considering all sources of uncertainty, we cannot attribute this to the external forcing or to errors in either data source.
Fourth, our data sources do not agree on the strength of the relationship between temperature and precipitation. However, 5 the relationships between both parameters share some common co-variance on interannual time scales between the sources of information for the season from March to July, e.g. in the 19th century.
Generally, a dominant role of internal variability could explain the lack of consistency in standardised precipitation measures in the different data sets on the temporal and spatial scales we consider here; the relative role of the external climate forcing generally becomes weaker at smaller spatial and shorter temporal scales (Deser et al., 2012b). The lack of general consistency 10 and slightly differing interannual relations between temperature and precipitation still require a closer look at the uncertainties of observations, the methods and input data of reconstructions, and dynamical and thermodynamical representations of regional climate in regional simulations.
A supplementary document for this manuscript will be deposited at https://osf.io/duyqe/.
The England-Wales Precipitation data is available from the Met Office, https://www.metoffice.gov.uk/hadobs/hadukp/ as are the subdivisions for South-East, South-West, and Central England.
If deemed relevant for future work, we are going to provide the standardised data as well via a public repository.
Considering the data used in the supplementary document, we are unable to provide the data by Rinne et al. (2013) Figure A1, top). It is generally too cold over the Baltic region, the eastern part of the model domain, the southern border of the domain over Africa, and central Europe. High elevation and southern area warm biases frequently exceed 6K. Cold biases exceed 2 to 4K occasionally over northeastern Europe and at the southern border of the domain. We attribute these biases to some extent to the cruder representation of the European orography and, possibly related to that, to biases in the modelled atmospheric circulation. However, the specific choice of forcings may also influence the climatology. 25 In the regional CCLM simulation ( Figure A1 For precipitation, summer is frequently too dry in central Europe in COSMOS-MPI-ESM and especially at the west coast of Scotland and in the Alps ( Figure A2, top row). The southern domain is generally too dry in spring when Scandinavia is 30 slightly too wet. Coastal and mountainous regions as well as Iberia, Italy, and southern France are more likely to be too dry in autumn and winter. Scandinavia is also too wet in autumn. The COSMOS-MPI-ESM winter climatology is too wet over much of central, eastern, and northern Europe.
In CCLM, too dry conditions are generally confined to southern Europe and North Africa and areas affected by the storm track, i.e. the coasts of Scotland and Norway ( Figure A2, bottom row). They extend to southern central Europe only in summer.
The climate is too wet in Scandinavia and northeastern Europe in most seasons. Large parts of Europe are too wet in all seasons except summer. Noteworthy is the excess precipitation at the northern flank of the Alps from autumn to spring. Part of these discrepancies are possibly attributable to a too zonal airflow outside the summer season.

5
In summarizing, the model presents a too strong latitudinal temperature gradient over the European domain. The annual cycle of temperature is apparently too strong in the South with warm biases in summer but cold biases in winter and it is slightly too weak in the North with cold biases being stronger in summer than in winter. Similarly to temperature, the gradient in precipitation also appears to be too strong and the annual cycle amplitude differs between simulation and gridded observational estimates especially for Central Europe. Specifically, autumn to spring are wetter in the simulation while summer conditions 10 differ only slightly or are too dry, which implies a weaker annual cycle compared to observations. Appendix B: Uncertainty of running measures Figure B1 shows bootstrap estimates over thousand 40-year samples for each 51-year window. The estimates are for the running measures for reconstructions and observations for the three regions of interest (red) and the regional simulation (blue). The top row are Weibull standard deviations and the bottom row is for the percentiles. 15 The Figure highlights that sampling variability is generally larger for the simulated data. Indeed sampling variability may render differences between periods non-significant. However, also the bootstrap distributions appear strongly skewed.

Appendix C: Distributional parameters
The Weibull distribution is a two parameter distribution with a scale and a shape parameter. See, e.g., Sienz et al. (2012), for more details and how the distribution compares to other distributions in computing the Standardised Precipitation Index. 20 Figures C1 and C2 present the shape, k, and scale, λ, parameters of our Weibull distribution fits for the reconstructions for East Anglia and Southern-Central England, the observational England-Wales precipitation, and the respective time series in the simulation.
Results for the simulation show very similar evolutions among regions highlighting the homogeneity of the simulation data. There are also similarities between the two reconstructions. One could argue the shape parameters evolve similarly in 25 observation and simulation.
The shape parameter determines the 'shape' of the distribution. In our cases, changes in this parameter are rather small (compare Figure C1). Nevertheless they can result in notably different widths of distributions for a specific data set over time.
It is interesting that there is only small overlap between the range of shape parameters for the East Anglia reconstruction and all other series.

30
Larger scale parameters for a constant shape parameter result in a flatter distribution that extends further to larger values.
Smaller values result in a narrower distribution with larger probability density at its peak.
The evolution of the shape parameter reflects, in our cases, the evolution of the skewness of the distributions (not shown).
All distributions show negative skewness, and the amplitude increases with increases in the shape parameter. Figure     . Extended spring (MAMJJ) precipitation in (paleo-)observation based data and simulation output, a) East Anglia precipitation in reconstruction (black) and regional model (blue), b) Southern-Central England precipitation in reconstructions (black) and regional simulation (blue), and c) England-Wales precipitation in observational data (black) and regional simulation (blue). We show interannual data (light colors) and 51-point Hamming-filtered data (solid colored).    Year CE Figure C2. Evolution of the scale parameter λ for the Weibull distribution fits for the a) East Anglia reconstruction, b) Southern-Central England reconstruction, c) England-Wales precipitation observational data, d) East Anglia regional simulation, e) Southern-Central England regional simulation, f) England-Wales precipitation regional simulation. England reconstruction, c) England-Wales precipitation observational data, d) East Anglia regional simulation, e) Southern-Central England regional simulation, f) England-Wales precipitation regional simulation.