Significance of uncertain phasing between the onsets of stadial–interstadial transitions in different Greenland ice core proxies

Different paleoclimate proxy records evidence repeated abrupt climate transitions during previous glacial intervals. These transitions are thought to comprise abrupt warming and increase in local precipitation over Greenland, sudden reorganization of the Northern Hemisphere atmospheric circulation, and retreat of sea ice in the North Atlantic. The physical mechanism underlying these so-called Dansgaard–Oeschger (DO) events remains debated. A recent analysis of Greenland ice core proxy records found that transitions in Na concentrations and δ18O values are delayed by about 1 decade with respect to corresponding transitions in Ca2+ concentrations and in the annual layer thickness during DO events. These delays are interpreted as a temporal lag of sea-ice retreat and Greenland warming with respect to a synopticand hemispheric-scale atmospheric reorganization at the onset of DO events and may thereby help constrain possible triggering mechanisms for the DO events. However, the explanatory power of these results is limited by the uncertainty of the transition onset detection in noisy proxy records. Here, we extend previous work by testing the significance of the reported lags with respect to the null hypothesis that the proposed transition order is in fact not systematically favored. If the detection uncertainties are averaged out, the temporal delays in the δ18O and Na transitions with respect to their counterparts in Ca2+ and the annual layer thickness are indeed pairwise statistically significant. In contrast, under rigorous propagation of uncertainty, three statistical tests cannot provide evidence against the null hypothesis. We thus confirm the previously reported tendency of delayed transitions in the δ18O and Na concentration records. Yet, given the uncertainties in the determination of the transition onsets, it cannot be decided whether these tendencies are truly the imprint of a prescribed transition order or whether they are due to chance. The analyzed set of DO transitions can therefore not serve as evidence for systematic lead–lag relationships between the transitions in the different proxies, which in turn limits the power of the observed tendencies to constrain possible physical causes of the DO events.


Introduction
In view of anthropogenic global warming, concerns have been raised that several subsystems of the earth's climate system may undergo abrupt and fundamental state transitions if temperatures exceed corresponding critical thresholds (Lenton and Schellnhuber, 2007;Lenton et al., 2008Lenton et al., , 2019. Under sustained warming, the Atlantic Meridional Overturning Circulation (AMOC), the Amazon rainforest, or the Greenland ice sheet are, among others, possible candidates to abruptly transition to 25 new equilibrium states that may differ strongly from their current states (Lenton et al., 2008). Understanding the physical mechanisms behind abrupt shifts in climatic subsystems is crucial for assessing the associated risks and for defining safe operating spaces in terms of cumulative greenhouse gas emissions. To date, empirical evidence for abrupt climate transitions only comes from paleoclimate proxy records encoding climate variability in the long-term past. First discovered in the δ 18 O records from Greenland ice cores, the so-called Dansgaard-Oeschger (DO) events are considered the archetype of past abrupt climate 30 changes (see Fig. 1) (Johnsen et al., 1992;Dansgaard et al., 1993;Bond et al., 1993;Andersen et al., 2004). These events constitute a series of abrupt regional warming transitions that punctuated the last and previous glacial intervals at millennial recurrence periods. Amplitudes of these decadal-scale temperature increases reach from 5°C to 16.5°C over Greenland (Kindler et al., 2014;Huber et al., 2006;Landais et al., 2005). The abrupt warming is followed by gradual cooling over centuries to millennia, before the climate abruptly transitions back to cold conditions. The relatively cold (warm) intervals within the glacial 35 episodes have been termed Greenland stadials (GS) (Greenland interstadials (GI)). GS typically persist over millennial time scale, before another abrupt warming starts a new cycle (Rasmussen et al., 2014;Ditlevsen et al., 2007). Despite being less pronounced, a global impact of DO events on climate and ecosystems is evident in many proxy records (e.g. Moseley et al., 2020;Buizert et al., 2015;Lynch-Stieglitz, 2017;Kim et al., 2012;Fleitmann et al., 2009;Voelker, 2002;Cheng et al., 2013).
Apart from δ 18 O, other Greenland ice core proxy records, such as Ca 2+ and Na + concentrations as well as the annual layer 40 thickness λ, also bear the signature of DO cycles, as can be seen in Fig. 1 (e.g., Erhardt et al., 2019;Fuhrer et al., 1999;Ruth et al., 2007). While δ 18 O is interpreted as a qualitative proxy for ice core site temperatures (e.g. Gkinis et al., 2014;Jouzel et al., 1997;Johnsen et al., 2001), changes in Ca 2+ concentrations -or equivalently dust -are believed to reflect changes in the atmospheric circulation (Ruth et al., 2007;Erhardt et al., 2019). Na + concentration records indicate past sea-salt aerosol concentrations and are thought to negatively correlate with North Atlantic sea-ice cover (Erhardt et al., 2019;Schüpbach et al., 45 2018). The annual layer thickness depends on past accumulation rates at the drilling site and hence indicates local precipitation driven by synoptic circulation patterns (Erhardt et al., 2019). According to this proxy record interpretation, DO events comprise not only sudden warming, but also sudden increase in local precipitation amounts, retreat of the North Atlantic sea-ice cover, and changes of hemispheric circulation patterns.
In the search for the mechanism(s) causing or triggering DO events, several attempts have been made to deduce the relative 50 temporal order of these abrupt changes by analyzing the phasing of corresponding abrupt shifts detected in multi-proxy time series from Greenland ice cores (Erhardt et al., 2019;Thomas et al., 2009;Steffensen et al., 2008;Ruth et al., 2007). While Thomas et al. (2009) and Steffensen et al. (2008) report delayed Greenland warming with respect to atmospheric changes for the onsets of GI-8 and GI-1 and the Holocene, Ruth et al. (2007) find no systematic lead or lag between NGRIP dust concentration , annual layer thickness λ (cyan), Ca 2+ (orange), and Na + (green) from the NGRIP ice core, together with time series of Ca 2+ (red) and Na + (light green) from the NEEM ice core on the GICC05 timescale in ky b2k, at 10-year resolution.
Light blue vertical lines mark the timings of DO events. The δ 18 O data and the GICC05 timescale are due to Andersen et al. (2004); Gkinis et al. (2014); Vinther et al. (2006); ; Andersen et al. (2006); Svensson et al. (2008). All other time series are retrieved from Erhardt et al. (2019) and for the DO event timings and Greenland Interstadial (GI) notation we followed Rasmussen et al. (2014). and δ 18 O changes across the onsets of GI-1 to GI-24. However, the comprehensive study conducted by Erhardt et al. (2019) 55 concludes that, on average, initial changes in both, terrestrial dust aerosol concentrations (Ca 2+ ) and local precipitation (λ) have preceded the changes in local temperatures (δ 18 O) and sea-salt aerosol concentrations (Na + ) by roughly one decade at the onset of DO events during the last glacial cycle.
These observation-based studies are complemented by numerous conceptual theories and modeling studies that explore a variety of mechanisms to explain the DO events. Many studies emphasize the role of the AMOC in the emergence of DO events 60 (Broecker et al., 1985;Clark et al., 2002;Ganopolski and Rahmstorf, 2001;Henry et al., 2016). In this context, Vettoretti and Peltier (2018) identified a self-sustained sea-salt oscillation mechanism to initiate transitions between stadials and interstadials in a comprehensive general circulation model (GCM) run, while Boers et al. (2018) proposed a coupling between sea-ice growth, subsurface-warming, and AMOC changes to explain the DO cycles. Moreover, Li and Born (2019) draw attention to the subpolar gyre, a sensitive region that features strong interactions between atmosphere, ocean and sea ice. In line with the resolution (7 years or higher) multi-proxy time series around 23 major DO events from the later half of the last glacial cycle, from the NEEM and the NGRIP ice cores (Erhardt et al., 2019). The fact that high-frequency internal climate variability blurs abrupt transitions, limits the ability to precisely detect their onset in the proxy data and thereby constitutes the main obstacle 75 for the statistical analysis of the succession of events. The method designed by Erhardt et al. (2019) very conveniently takes this into account and instead of returning scalar estimators it quantifies the transition onsets in terms of Bayesian posterior probability densities that indicate the plausibility for a transition onset at a certain time in view of the data. This gives rise to a set of uncertain DO transition onset lags for each pair of proxies under study, whose statistical interpretation is the goal of this study. 80 While Erhardt et al. (2019) report transition onsets, midpoints, and endpoints, we restrict our investigation to the transition onset points, since we consider the leads and lags between the initial changes in the different proxy records to be the relevant quantity for a potential identification of the physical trigger of the DO events. We extend the previous work by interpreting the sets of uncertain lags as samples generated in random experiments from corresponding unknown populations -each proxy pair is associated with its own population of lags. This allows to investigate whether the reported average lags (Erhardt et al.,85 2019) are a systematic feature or whether they might have emerged by chance. In order to review the statistical evidence for potential systematic lags, we formalize the notion of a 'systematic lag': We call a lag systematic if it is enshrined in the random experiment in form of a population mean different from zero. Samples generated from such a population with non-zero mean would systematically (and not by chance) exhibit sample means different from zero. Accordingly, we formulate the null hypothesis that the proposed transition sequence is in fact not physically favoured. In mathematical terms this corresponds to 90 an underlying population of lags with mean equal to zero or with reversed sign with respect to the observed lags. A rejection of this null hypothesis would statistically corroborate the interpretation that transitions in δ 18 O and Na + systematically lag their counterparts in λ and Ca 2+ . On the other hand, acceptance of the hypothesis would prevent us from ruling out that the observed lag tendencies are a coincidence and not a systematic feature. We have identified three different statistical tests suitable for this task, which all rely on slightly different assumptions. Therefore, in combination they yield a robust assessment 95 of the observations. Most importantly, we propagate the uncertainties that arise from the transition onset detection to the level of p-values of the different tests.
We will show that, if the uncertainties are averaged out at the level of the individual transition onset lags -thus ignoring the uncertainties in the onset detection -all tests indicate statistical significance (at 5% confidence level) of the observed tendencies toward delayed δ 18 O and Na + transition onsets with respect to the corresponding onsets in λ and Ca 2+ . Rigorous 100 uncertainty propagation yields, however, substantial probabilities for the observed transition onset lags to be non-significant with respect to the null hypothesis. We thus argue that the uncertainties in the transition onset detection are too large to infer a population mean different from zero in the direction of the observed tendencies. In turn, this prevents the attribution of the observed lead-lag relations to a fundamental mechanism underlying the DO events. We discuss the difference between our approach and the one followed by Erhardt et al. (2019) in detail below.

105
In addition to the quantitative uncertainty discussed here, there is always qualitative uncertainty about the interpretation of climate proxies. Clearly, there is no one-to-one mapping between proxy variables and the climate variables they are assumed to represent. To give an example, changes in the atmospheric circulation will simultaneously impact the transport efficiency of sea-salt aerosols to Greenland. Schüpbach et al. (2018) discuss in detail the entanglement of transport efficiency changes and source emission changes for aerosol proxies measured in Greenland ice cores. We restrict our analysis to those proxy pairs This article is structured as follows: First, the data used for the study is described. Second, we introduce our methodology in general terms, in order to facilitate potential adaptation to structurally similar problems. Within this section, we pay special attention to clarifying the differences between the approaches chosen in this study and by Erhardt et al. (2019). This is followed 115 by the presentation of our results including a comparison to previous results. In the subsequent discussion, we give a statistical interpretation and explain how the two lines of inference lead to different conclusions. The last section summarizes the key conclusions that can be drawn from our analysis.

Data
In conjunction with their study, Erhardt et al. (2019) published 23 highly resolved time series for Ca 2+ and Na + concentrations 120 from the NGRIP and NEEM ice cores for time intervals of 250 to 500 years centered around DO events from the later half of the last glacial. The data set covers all major interstadial onsets from GI-17.2 to the Holocene, as determined by Rasmussen et al. (2014). The time resolution decreases from 2 to 4 years with increasing depth in the ice cores due to the thinning of the core. In addition, Erhardt et al. (2019) derived the annual layer thickness from the NGRIP aerosol data and published these records likewise for the time intervals described above. Furthermore, continuous 10-year resolution versions of the proxy 125 records were published, which cover the period 60-10 kyr BP, shown in Fig. 1

Methods
We first briefly review the probabilistic method that we adopted from Erhardt et al. (2019) in order to estimate the transition 140 onset time t 0 of each proxy variable for each DO event comprised in the data (see Fig. 3). The Bayesian method accounts for the uncertainty inherent to the determination of t 0 by returning probability densities ρ T0 (t 0 ) instead of scalar estimators. From these distributions, corresponding probability distributions for the pairwise time lags between two proxies can be derived for all DO events. Second, a statistical perspective on the series of DO events is established. For a given proxy pair, the set of transition onset lags from the different DO events is treated as a sample of observations from an unknown underlying population. In this 145 very common setup, naturally one would use hypothesis tests to constrain the population. In particular, the question whether any lag tendencies observed in the data are a systematic feature or whether they have instead occurred by chance can be assessed by testing a convenient null hypothesis. However, the particularity that the individual observations that comprise the sample are themselves subject to uncertainty requires a generalization of the hypothesis tests. We propagate the uncertainty of the transition onset timings to the p-values of the tests and hence obtain uncertain p-values in terms of probability densities 150 (see Fig. 4 and Fig.7). While in common hypothesis tests the scalar p-value is compared to a predefined significance level, here we propose two criteria to project the p-value distribution onto the binary decision between acceptance and rejection of the null hypothesis. After this general characterization of the statistical problem, we introduce the tests which we employ for the analysis. Finally, we compare our approach to the one followed by Erhardt et al. (2019). The key idea is to model the transition as a linear ramp L(t i ) perturbed by red noise (t i ), that is an autoregressive process of first order: (1)

165
This model is fully determined by the four ramp parameters {t 0 , y 0 , τ, ∆y}, the amplitude σ, and the autoregressive coefficient α of the AR(1) process. For a given configuration θ of these six parameters, the probability for this stochastic model to exactly reproduce the data D reads where δ i = x(t i ) − L(t i ) denote the residuals between the linear ramp and the observations and δ 0 = 0. Bayes' Theorem 170 immediately yields the posterior probability density for the model parameters π(θ|D) upon introduction of convenient priors π(θ): where the normalization constant π(D) = π(D|θ)π(θ)dθ is the a priori probability for the observations. Since the parameter space is six-dimensional, Eq. 3 cannot be evaluated explicitly on a grid with reasonably fine spacing. Instead, an MCMC-175 algorithm is used to sample a representative set {θ 1 , ..., θ m } of parameter configurations from the posterior distribution that approximates the continuous distribution in the sense that for smooth functions f where the notion of a so-called empirical distributionρ Θ (θ) = 1 m m j=1 δ(θ − θ j ) has been used. The use of the MCMC algorithm further allows to omit the normalization constant π(D). The number m of individuals comprised in the MCMC sample 180 must be chosen large enough to ensure a good approximation in Eq. 4. The marginal distribution for the parameter t 0 relevant for our study can be obtained by integration over the remaining parameters θ * : which reads 185 in terms of the empirical density induced by the MCMC sample.
Given the probability densities for the transition onsets of two proxy variables p and q at a chosen DO event i, the probability density for the lag ∆t p,q i = t p,i 0 − t q,i 0 between them reads ∆T p,q i was chosen to denote the time lag which inherits the uncertainty from the transition onset detection and must thus 190 mathematically be treated as a random variable. ∆t p,q i denotes a potential value that ∆T p,q i may assume. The set of probability densities {ρ ∆T p,q i (∆t p,q i )} i derived from the different DO events conveniently describes the random vector of uncertain DO onset lag observations ∆T p,q = (∆T p,q 1 , ..., ∆T p,q n ) for the (p, q) proxy pair in the sense that Note that the entries ∆T p,q i of the random vector ∆T p,q are independent from each other and follow their individual dis-195 tributions ρ ∆T p,q i (∆t p,q i ), such that the joint distribution is given by the product of the individual distributions. A cross-core comparison is not possible, because the relative dating uncertainties between the cores exceed the magnitude of the potential time lags.
For sake of simplicity, we omit the difference between the posterior density distribution and the empirical posterior density distribution induced by an MCMC sample. It is shown in Appendix A that all methods can be equivalently formulated in terms 200 of the empirical posterior density distribution. The numerical computations themselves have of course been carried out with the empirical densities obtained from the MCMC sampler. Appendix B discusses the construction of numerically manageable empirical densitiesρ ∆T p,q (∆t p,q ). Since substantial reduction of the available MCMC sampled data is required, a control group of alternative realizations ofρ ∆T p,q (∆t p,q ) is introduced. The high agreement of the results obtained from the control group with the results discussed in the main text confirms the validity of the initialρ ∆T p,q (∆t p,q ) construction.

205
In the following, all probability densities that represent uncertainties with origin in the transition onset observation will be referred to as uncertainty distributions or uncertainty densities. This facilitates to differentiate these from probability distributions that generically characterize random experiments. The random variables described by uncertainty distributions will be termed uncertain variables and will be marked with a hat. Generally, we denote all random (uncertain) variables by capital letters X (X), while realizations will be denoted with lower case letters x (x). Furthermore, distributions will always be sub-210 scripted with the random variables that they characterize, e.g. ρ X (x) (ρX (x)). For sake of readability, sometimes we omit the index p, q when it is clear that a quantity refers to a pair of proxies (p, q).

Statistical setting
Despite their diversity in terms of temperature amplitude, duration, and frequency across the last glacial, the reoccurring patterns and their common manifestation in different proxies suggest that the DO events follow a common physical mechanism.

215
If this assumption holds true, this mechanism prescribes a fixed pattern of causes and effects for all DO events -at least on the scale of interactions between climatic subsystems represented by the proxies under study. However, natural variability randomly delays or advances the individual parts of the event chain of the DO mechanism in each single realization, without violating the mechanistic causality. The observed pairwise transition onset lags can thus be regarded as realizations of independent and identically distributed (i.i.d.) random variables generated in a random experiment (Ω, F, P p,q ∆T ) on the sample space 220 Ω = R. Here, F is a σ-algebra defined on Ω and may be taken as the Borel algebra. P p,q ∆T -the so-called population -denotes a probability measure with respect to F and fully characterizes the random lag ∆T p,q between the proxies p and q. Importantly, if any of the proxy variables investigated here was to represent a climate variable associated with the DO event trigger, we would expect an advanced initial change in the record of this proxy with respect to other proxies at DO events. In turn, a pronounced delay of a proxy record's transition onset contradicts the assumption that the proxy represents a climate variable 225 associated with the trigger. Therefore, the identification of leads and lags between the transition onsets in the individual proxy time series may help in the search for the trigger of the DO events. Here, we formalize the investigation of systematic lead-lag relationships between the proxy transitions. The random experiment framework allows to relate a suspected transition sequence −20 to a mean of the generating population P p,q ∆T differing from zero in the according direction. Evidence for the suspected sequence can then be obtained by testing the null hypothesis of a population mean equal to zero or with sign opposed to the suspected lag 230 direction. If this null hypothesis can be rejected based on the observations, this would constitute a strong hint for a systematic, physical lag, and would hence potentially yield valuable information on the search for the mechanism(s) and trigger(s) of the DO transitions.
According to the data selection by Erhardt et al. (2019) as explained in Sec. 2, for all studied pairs of proxies we compute either 16 or 20 transition lags from the different DO events, which we interpret as samples ∆t p,q = (∆t p,q 1 , ..., ∆t p,q n ) from 235 their respective populations P p,q ∆T . Studying these samples, Erhardt et al. (2019) deduced a decadal-scale delay in the transition onsets in Na + and δ 18 O records with respect to their counterparts in Ca 2+ and λ. In order to test if the data supports evidence for this lag to be systematic in a statistical sense, the notion of a 'systematic lag' first needs to be formalized mathematically. We argue that a physical process that systematically delays one of the proxy variable transitions with respect to another must in the random experiment framework be associated with a population that exhibits a mean different from zero µ p,q = E(∆T p,q ) = 0.

240
The outcomes of such a random experiment will systematically exhibit sample means different from zero in accordance with the population mean. Samples generated from a population with mean equal to zero may as well yield sample means that strongly differ from zero. However, their occurrence is not systematic but rather a coincidence. Given a limited number of observations, hypothesis tests provide a consistent, yet not unambiguous way to distinguish systematic from random features.
If the mean of the observed sample u p,q (∆t p,q ) = 1 n ∆t p,q i indicates an apparent lag between the proxies p and q, testing if 245 the sample statistically contradicts a population that favours no (µ p,q = 0) or the opposite lag (or sign(µ p,q ) = sign(u p,q )) can provide evidence at the significance level α for the observed mean lag to be systematic in the sense that sign(µ p,q ) = sign(u p,q ).
However, as long as the null hypothesis cannot be rejected, the observed average sample lag cannot be regarded as statistical evidence for a systematic lag.
Before we introduce the tests deployed for this study, we discuss the particularity that the individual observations of the i.i.d.

250
variables that comprise our samples are themselves subject to uncertainty and hence are represented by probability densities instead of scalar values. The common literature on hypothesis tests assumes that an observation of a random variable yields a scalar value. Given a sample of n scalar observations the application of hypothesis tests to the sample is in general straight forward and has been abundantly discussed (e.g. Lehmann

255
and Romano, 2006). In short, a test statistic φ denotes the mapping from the space of n-dimensional samples to the space of the test statistic and φ x denotes the explicit value of the function when applied to the observed sample x. Subsequently, integration of the so-called null distribution over all values φ , which under the null hypothesis H 0 are more extreme than the observed φ x , yields the test's p-value. In this study, 260 an hypothesis on the lower limit of a parameter will be tested. In this one-sided left-tailed application of hypothesis testing, the p-value explicitly reads which defines the mapping: 265 Analogous expressions may be given for one-sided right-tailed and two-sided tests. The null distribution ρ 0 Φ (φ ) is the theoretical distribution of the random test statistic Φ = φ(X) under the assumption that the null hypothesis on the population P X holds true. If p x is less than a predefined significance level α, the observed sample x is said to contradict the null hypothesis at the significance level α, and the null hypothesis should be rejected.
In contrast to this setting, the DO transition onset lags ∆t p,q i between the proxies p and q, which are thought to have been 270 generated from the population P p,q ∆T , are observed with uncertainty. In our case, the entries in the vector of observations are uncertain variables themselves, which are characterized by the previously introduced uncertainty distributions ρ ∆T p,q i (∆t p,q i ). Fig. 4(a) illustrates this situation: from an underlying population P X a sample x = (x 1 , ..., x 6 ) is realized, with x i denoting the true values of the individual realizations. However, the exact value of x i cannot be measured precisely due to measurement uncertainties. Instead, an estimatorŶ i is introduced together with the uncertainty distribution ρŶ i (ŷ i ) that expresses the 275 observers belief about how likely a specific valueŷ i for the estimatorŶ i is to agree with the true value x i . TheŶ i correspond to the ∆T p,q i . For the x i there is no direct correspondence in the problem at hand, because this quantity cannot be accessed in practice and is thus not denoted explicitly. We call the vector of estimatorsŶ = (Ŷ 1 , ...,Ŷ n ) an uncertain sample in the following. Omitting the (p, q) notation, we denote an uncertain sample of time lags as Note that the uncertainty represented by the uncertain sample originates from the observation process -the sample no longer carries the generic randomness of the population P ∆T it was generated from. The ∆T i are no longer identically but yet independently distributed.
A simplistic approach to test hypotheses on an uncertain sample would be to average over the uncertainty distribution and 285 subsequently apply the test to the resulting expected sample Averaging out uncertainties, however, essentially implies that the uncertainties are ignored and is thus always associated with a loss of information. The need for a more thorough treatment, with proper propagation of the uncertainties, may be illustrated by a simple consideration. Assume that a random variable X can be observed at a given precision σ obs , where σ obs may be 290 interpreted as the typical width of the corresponding uncertainty distribution. For any finite number of observations of X, the measurement uncertainty limits the ability to infer properties of the population. For example, one cannot distinguish between potential candidates µ 0 and µ 1 for the population mean, whose difference ∆µ = |µ 0 −µ 1 | is a lot smaller than the observational precision, unless the number of observations tends to infinity. If uncertainties are averaged out, testing H 0 : µ = µ 0 against the alternative H 1 : µ = µ 1 can eventually still yield significance, even in the case where |µ 0 − µ 1 | σ obs . For relatively small 295 sample sizes, such an attested significance would be statistically meaningless. The uncertainties in the individual measurements considered here are on the order of a few decades, while the proposed size of the investigated time lag is roughly one decade.
In combination with the relatively small sample sizes of either 16 or 20 events, the scales involved in the analysis require a suitable treatment of the measurement uncertainties.
The uncertainty propagation relies on the fact that applying a function f : R → R to a real valued random (uncertain) variable 300 X yields a new random (uncertain) variable G = f (X), which is distributed according to Analogously, the uncertain test statisticΦ = φ(∆T) follows the distribution In the example shown in Fig. 4 the initial uncertainties in the observations translate into an uncertain p-value that features both, probability for significance and probability for non-significance. This illustrates the need for a criterion to project the uncertain 310 p-value onto a binary decision space comprised of rejection and acceptance of the null hypothesis. We propose to consider the following criteria to facilitate an informed decision: -The hypothesis shall be rejected at the significance level α if and only if the expected p-value is less than α, that is 1 0p ρP (p) dp < α .
-The hypothesis shall be rejected at the significance level α if and only if the probability for p to be less than α is greater 315 than a predefined threshold η (we propose η = 90%), that is While the p-value of a certain sample indicates its extremeness with respect to the null distribution, the expected p-value may be regarded as a measure of the uncertain sample's extremeness. Given the measurement uncertainty, the quantity π(P < α) constitutes an informed assessment of how likely or plausible the true value of the measured sample is to be statistically 320 significant with respect to the null hypothesis. Thus, the first criterion assesses how 'strongly' the uncertain sample contradicts the null hypothesis, while the second criterion evaluates the likelihood of the uncertain sample to contradict the null hypothesis.
The choice of η is arbitrary and may be tailored to the specific circumstances of the test. In some situations, one might want to reject the null hypothesis only in the presence of high probability for significance and therefore choose a large value for η -e.g.
when mistakenly attested significance is associated with high costs. In other situations, even small probabilities for significance 325 may be important, e.g. if a significant test result would be associated with high costs or with high risks. We propose to assess the two criteria in combination. In case they do not yield the same result, the weighing between the criteria must be adapted to the statistical problem at hand.

Hypothesis tests
We have introduced the notion of uncertain samples and its consequences for the application of hypothesis tests. Here, we 330 shortly introduce the tests used to test our null hypothesis that the observed tendency for delayed transition onsets in Na + and δ 18 O with respect to Ca 2+ and λ has occurred by chance and that the corresponding populations P p,q ∆T that characterize the pairwise random lags ∆T p,q do in fact not favour the tentative transition orders apparent from the observations. Mathematically, this can be formulated as follows: the proxy variables p and q and let the observations ∆T p,q suggest a delayed transition of the proxy q -that is, the corresponding uncertainty distributions ρ ∆T p,q i (∆t p,q i ) indicate high probabilities for negative ∆T p,q i across the sample according to Eq. 7. We then test the hypothesis H 0 : 'The mean value µ p,q = ρ p,q ∆T (∆t) d∆t of the population P p,q ∆T is greater than or equal to zero.' We identified three tests that are suited for this task, namely the t-test, the Wilcoxon-signed-rank (WSR) test, and a bootstrap 340 test. The WSR and the t-test are typically formulated in terms of paired observation {x i , y i } that give rise to a sample of differences {d i = x i − y i } which correspond to the time lags {∆t p,q i } of different DO events (Rice, 2007;Lehmann and Romano, 2006, e.g.). The null distributions of the tests rely on slightly different assumptions regarding the populations. Since we cannot guarantee the compliance of these assumptions, we apply the tests in combination to obtain a robust assessment.

t-test 345
The t-test (Student, 1908) relies on the assumption that the population of differences P D is normally distributed with mean µ and standard deviation σ. For a random sample D = (D 1 , ..., D n ) the test statistic The resulting p-value must then be compared to the predefined significance level α.
The t-test can be generalized for application to an uncertain sample of the form ∆T = (∆T 1 , ..., ∆T n ) as follows: Let 355 ρ ∆T (∆t) denote the uncertainty distribution of ∆T. Then according to Eq. 15 the distribution of the uncertain statistiĉ Finally, the distribution of the uncertain p-value may again be computed according to Eq. 15 360 and then be evaluated according to the two criteria formulated above.

Wilcoxon-signed-rank
Compared to the t-test, the WSR test (Wilcoxon, 1945) allows to relax the assumption of normality imposed on the generating population P D , and replaces it by the weaker assumption of symmetry with respect to its mean µ in order to test the null hypothesis H 0 : µ ≥ 0. The test statistic W for this test is defined as where R(|D i |) denotes the rank of |D i | within the sorted set of the absolute values of differences {|D i |}. The Heaviside function Θ(D i ) guarantees that exclusively D i > 0 are summed. The derivation of the null distribution is a purely combinatorial problem and its explicit form can be be found in lookup tables. Because W ∈ N [0,n(n+1)/2] we denote the null distribution by P 0 W (w) to signal that this is not a continuous density. Explicitly, the null distribution can be derived as follows: First, the 370 assumption of symmetry around zero (for the hypothesis H 0 : µ ≥ 0 the relevant null distribution builds on µ = 0) guarantees that the chance for D i to be positive is equal to 1 2 . Hence, the number of positive outcomes m follows a symmetric binomial distribution π(m) = n m ( 1 2 ) n . For m positive observations, there are n m different sets of ranks {r 1 , ..., r m } that they may assume, and which are again due to the symmetry of P D equally likely. Hence, for a given number of positive outcomes m the probability to obtain a test statistic w is given by the share of those n m configurations that yield a rank sum equal to w.

375
Summing these probabilities over all possible values of m yields the null distribution for the test statistic w.
For a given sample d we test the hypothesis H 0 : µ ≥ 0 by computing the corresponding one-sided p-value p w , which is given by the cumulative probability that the null distribution assigns to w values smaller than the observed w(d): Since W ∈ N [0,n(n+1)/2] it follows that P w assumes only discrete values in [0, 1] with the null distribution determining the 380 mapping between these two sets.
The generalization of the WSR-test to the uncertain sample ∆T can be carried out almost analogously to the t-test. However, the fact that W ∈ N [0,n(n+1)/2] makes it inconvenient to use a continuous probability density distribution. We denote the distribution for the uncertainŴ (∆T) by

385
Given the one-to-one map from all w ∈ N [0,n(n+1)/2] to the set of discrete potential values p w for P w in [0, 1] determined by equation Eq. 25, the probability to obtainp w is already given by the probability to obtain the correspondingŵ. Hence, we find

Bootstrap test
Given an observed sample of differences d = (d 1 , ..., d n ), a bootstrap test constitutes a third option to test the compatibility of 390 the sample with the hypothesis that the population of differences features a mean equal to or greater than zero: H 0 := µ 0 ≥ 0.
Guidance for the construction of a bootstrap hypothesis test can be found in Lehmann and Romano (2006) and Hall and Wilson (1991). The advantage of the bootstrap test lies in its independence from assumptions regarding the distributions' shape. Lehmann and Romano (2006) propose the test statistic

395
with u(d) = 1 n n i=1 d i denoting the sample mean. In contrast to the above two tests, the bootstrap test constructs the null distribution directly from the observed data. In the absence of assumptions, the best available approximation of the population P D is given by the empirical density However, the empirical density does not necessarily comply with the null hypothesis and it thus has to be shifted accordingly: Setting m = 10000 we obtain robust null distributions for the two cases relevant here (n = 16 and n = 20). The p-value of this bootstrap test is then computed as before in a one-sided manner where the right hand side equals the fraction of resampledṽ j that are smaller than v(d) of the original sample.

410
In the case where the sample of differences is uncertain, as for ∆T = (∆T 1 , ..., ∆T n ), the construction scheme for ρ 0 V needs to be adjusted to reflect these uncertainties. In principle, each possible value ∆t for the uncertain ∆T is associated with its own null distribution ρ 0 V (v, ∆t). In this sense, the value for the test statistic v(∆t) should be compared to the corresponding ρ 0 V (v, ∆t) to derive a p-value for this ∆t. Eqs. 31 and 32 define a mapping from ∆t to its corresponding p-value. To compute the uncertainty distribution for the p-value, this map has to be evaluated for all potential ∆t, weighted by the uncertainty distribution ρ ∆T (∆t): The three tests are applied in combination in order to compensate their individual deficits. If the population P ∆T was truly Gaussian, the t-test would be the most powerful test, i.e., its rejection region would be the largest across all tests on the population mean (Lehmann and Romano, 2006). Since normality of P ∆T cannot be guaranteed, the less powerful  signed-rank test constitutes a meaningful supplement to the t-test, relying on the somewhat weaker assumption that P ∆T is symmetric around zero. Finally, the bootstrap test is non-parametric and in view of its independence from any assumptions adds a valuable contribution.

Comparison to previous analysis
For the derivation of the transition lag uncertainty distributions ρ ∆T p,q i (∆t p,q i ) of the i-th DO event between the proxies p and This implicitly assumes that all DO events share the exact same time lag ∆t * between the variables p and q. This is realized by inserting a single argument ∆t * into the different distributions ρ ∆Ti (·). Hence, the product on the right hand side of Eq. 34 in fact indicates the probability that all DO events assume the time lag ∆t * , provided that they all assume the same lag: The denominator on the right hand side equals the probability that all DO events share a common time lag. Eq. 34 strongly emphasizes those regions where all uncertainty distributions ρ ∆Ti (∆t i ) are simultaneously substantially larger than zero. The 'combined estimate' answers the question: Provided that all DO events exhibit the same lag between the transition onsets of p and q, then how likely is it that this lag is given by ∆t * . Drawing on this quantity, (Erhardt et al., 2019) conclude that δ 18 O 440 and Na + 'on average' lag Ca 2+ and λ by about one decade.
Thinking of the DO transition onset lags as i.i.d. random variables of a repeatedly executed random experiment takes into account the natural variability between different DO events and hence, it removes the restricting a priori assumption ∆t 1 = ... = ∆t n . In our approach we have related the potentially systematic character of lags to the population mean. Since the sample mean is the best point-estimate of a population mean, we consider it to reasonably indicate potential leads and lags, 445 whose significance should be tested in a second step. Thus, we ascribe the sample mean a similar role as Erhardt et al. (2019) ascribe to the 'combined estimate' and therefore, we present a comparison of these two quantities in Sec. 4.1.
The mean of an uncertain sampleÛ = u(∆T) is again an uncertain quantity and its distribution reads While the 'combined estimate' multiplies the distributions ρ ∆Ti (∆t * ), the uncertain sample mean convolutes them pairwise 450 (see Appendix C). We thus expect the distributions for uncertain sample means to be broader than the corresponding distributions for the 'combined estimate'. This can be motivated by considering the simple example of two Gaussian variables X and Y . According to the convolution their sample mean U = X+Y 2 is normally distributed with variance σ 2 x * y = σ 2 x +σ 2 y 4 . In contrast, a combined estimate would yield a normal distribution with variance σ 2 xy = σ 2 x +σ 2 y σ 2 x σ 2 y . Thus, the convolution will appear broader for all σ 2 x σ 2 y > 4, which is the case for the distributions considered in this study.

Results
In the following we apply the above methodology to the different pairs of proxies that Erhardt et al. (2019)  We first study the uncertain sample means. As already mentioned, the sample mean is the best available point estimate for 465 the population mean. Hence, sample means different from zero may be regarded as first indications for potential systematic lead-lag relationships and thus motivate the application of hypothesis tests. We compare the results obtained for the uncertain sample means with corresponding results for the 'combined estimate'. Both quantities indicate a tendency towards a delayed transition in Na + and δ 18 O. Accordingly, in the subsequent section we apply the generalized hypothesis tests introduced above to the uncertain samples of transition lags to test the null hypothesis that pairwise the apparent transition sequence is not 470 systematically favoured, that is, that the populations have mean equal or greater than zero. 95th percentiles across all pairs. These deviations might originate from the stochastic MCMC-sampling process used for the analysis.

480
With the sample mean being the best point estimator of the population mean, it serves as a suitable indicator for a potential population mean different from zero. The expectations for the sample means of all proxy pairs do in fact suggest a tendency towards negative values in all distributions, i.e., a delay of the Na + and δ 18 O transition onsets with respect to Ca 2+ and λ. This indication is weakest for (Ca 2+ , Na + ) and (Ca 2+ , δ 18 O) 485 from NGRIP, since for these pairs we find non-zero probability for a positive sample mean. For the other pairs the indication is comparably strong, with the 95th percentiles of the uncertainty distributions for the sample mean still being less than zero.
Overall, the results for the uncertain mean confirm the previously reported tendencies and in very rough terms, the distributions qualitatively agree with those for the 'combined estimate'. In agreement with the heuristic example from Sec. 3.4, we find the sample mean distributions to be broader than the 'combined estimate' distributions in all cases. The expected sample means 490 indicate less pronounced lags for (Ca 2 , Na + ) ( Fig. 5(a)) and (Ca 2+ , δ 18 O) (Fig. 5(c)) from the NGRIP ice core compared to the expectations of the corresponding 'combined estimate'. In combination with the broadening of the distribution, this yields considerable probabilities for U > 0 of 12% and 14%, respectively, indicating a delayed transition of Ca 2+ in the sample mean with respect to Na + or δ 18 O. Contrarily, for (λ, Na + ) (NGRIP, Fig. 5 This motivates to test whether the observations significantly contradict the hypothesis of a population mean equal to or greater than zero. Accordingly, the subsequent section presents the results obtained from the application of three different hypothesis test that target the population mean. As discussed in Sec. 3, the tests have been modified to allow for a rigorous uncertainty propagation and return uncertainty distribution for their corresponding p-values, rather than scalars.

Statistical significance of the proposed lead-lag relations
Above, we identified three tests for testing the hypothesis that the samples ∆T p,q were actually generated from populations that on average feature no or even reversed time lags compared to what the sign of the corresponding uncertain sample mean suggests. Mathematically, this is equivalent to testing the hypothesis that the mean µ p,q of the population P p,q ∆T is greater than or equal to zero: H 0 : µ p,q ≥ 0. A rejection of this hypothesis would confirm that the assessed sample is very unlikely to stem from a population with µ p,q ≥ 0, and would thereby provide evidence for a systematic lag. Under the constraints indicated above this would in turn yield evidence for an actual lead of the corresponding climatic process. We have chosen a significance level of α = 0.05, which is a typical choice. Fig. 7 summarizes the final uncertainty distributions of the three tests for all proxy pairs under study. Corresponding values are given in Tab. 1. Fig. 6 exemplarily illustrates the application of the three tests to the empirical densities obtained for ∆T(Ca 2+ , Na + ) 515 (NGRIP). In Fig. 6 the initial uncertainty in the observations -i.e., the uncertainty encoded by the distributions of transition onset lags -is propagated to an uncertain test statistic according to Eq. 16. In turn, the uncertain test statistic yields an uncertain p-value (see Eq. 17). Since the numerical computation is based on empirical densities as generated by the MCMC sampling, we show the corresponding histograms instead of continuous densities -for the ρ ∆Ti (∆t i ) Gaussian kernel density estimates are presented only for the sake of visual clarity. On the level of the test statistics the red dashed line separates the acceptance 520 from the rejection region, based on the null distributions given in black. Qualitatively, the three tests yield the same results. The histograms clearly indicate non-zero probabilities for the test statistic in both regions. Correspondingly, the histograms for the p-values stretch well across the significance threshold. The shapes of the histograms resemble an exponential decay towards higher p-values. This results from the non-linear mapping of the test statistics to the p-values. Despite the pronounced bulk of empirical p-values below the significance level, the probability for non-significant p-values is still well above 50% for the 525 three tests (see Tab. 1). Also, the expected p-value exceeds the significance level for all tests. Hence, neither of the two criteria for rejecting the null hypothesis formulated in Sec. 3.2 is met for the proxy pair (Ca 2+ , Na + ). In contrast, if the observational uncertainties are averaged out on the level of the transition onset lags, all tests yield p-values below the significance level, which would indicate that the lags were indeed significant. Hence, the rigorous propagation of uncertainties qualitatively changes the statistical assessment of the uncertain sample of lags ∆T(Ca 2+ , Na + ) (NGRIP). While the expected sample rejects the null 530 hypothesis, rigorous uncertainty propagation leads to acceptance. This holds true for all tests. Fig. 7 summarizes the results obtained for all proxy pairs under study. Qualitatively, our findings are the same for the other pairs as for the (Ca 2+ , Na + ) (NGRIP) case discussed in detailed above. All expected p-values, as indicated by the pink bars, are above the significance level. Also, the probability for significance is below 60% for all pairs and all tests as shown by the pie charts. Therefore, for all proxy pairs and for all tests, the formulated decision criteria do thus not allow to reject the null 535 hypothesis of a population mean greater than or equal to zero. In contrast, all expected samples are significant across all tests with corresponding p-values indicated by the yellow bars. The proxy pairs with the lowest expected p-values and the highest probability forP < α are (λ, Na + ) from NGRIP and (Ca 2+ , Na + ) from NEEM, as already suggested by the analysis of the uncertain sample mean. For the NGRIP ice core the delay of Na + and δ 18 O with respect to Ca 2+ has a very low probability to be significant (approximately one third). The pair (λ, δ 18 O) ranges in between the latter two. 540 Table 1. Results from the application of the t-test, the WSR test and a bootstrap test to uncertain samples of DO transition onset lags ∆T p,q . E(P ) denotes the expected p-value, derived from the uncertainty-propagated p-value distribution. The probability for a significant test results associated with the same distribution is indicated by π(P < 0.05). For comparison, the p-values from the application of the tests to the expected sample E(∆T) = ρ ∆T (∆t)∆t d∆t are given in the bottom row.
NGRIP NEEM  Ca 2+ and λ transition onsets with respect to Na + and δ 18 O by approximately one decade, with the 90% confidence interval ranging from 0 to approximately 15 years. The 'combined estimate' implicitly assumes that for a given proxy pair all DO events share a common time lag (∆T p,q i = ∆T p,q j ). We argue that the variability across different DO events cannot be ignored in the assessment of the data. Although the DO events are likely to be caused by the same physical mechanism, changing boundary conditions and other natural climate fluctuations will lead to deviations in the exact timings of the different processes involved in triggering the individual DO events. Fig. 2 clearly shows that the different events exhibit different time lags. Provided that the DO events were driven by the same process, physically they constitute different realizations and they exhibit great variability also in other variables such as the amplitude of the temperature change (Kindler et al., 2014) or the waiting times with respect to the previous event (Ditlevsen et al., 2007;Boers et al., 2018). The random experiment framework introduced in this study allows to relax the constraint of a 555 common time lag ∆t * shared across all events, and reflects the fact that natural variability will cause different expressions of the same mechanism across different DO events. Moreover, this framework relates potential systematic leads and lags in the physical process that drives DO events to a corresponding non-zero mean of a population of lags between proxy variables. This allows for the physically meaningful formulation of a statistical hypothesis and a corresponding null hypothesis. By applying different hypothesis tests we have followed a well-established line of statistical inference. Motivated by the apparent transition here on the level of uncertain sample means, we tested the null hypothesis that the corresponding populations do not favor the proposed transition sequence. Rejection of this hypothesis would have provided evidence that the observed lag tendency is an imprint of the underlying physical process and therefore a systematic feature. However, generalized versions of three different hypothesis tests consistently fail to reject the null hypothesis under rigorous propagation of the observational uncertainties 565 originating from the MCMC-based transition onset detection. This holds true for all proxy pairs. The fact that the tests rely on different assumptions on the population's shape, but nonetheless qualitatively yield the same results, makes our assessment robust. We conclude that the possibility that the observed tendencies towards advanced transitions in Ca 2+ and λ have occurred simply by chance cannot be ruled out. If the common physical interpretation of the studied proxies holds true, our results imply that the hypothesis that the trigger of the DO events is associated directly with the North Atlantic sea-ice cover rather than the 570 atmospheric circulation -be it on synoptic or hemispheric scale -cannot be ruled out. We emphasize that our results should not be misunderstood as evidence against the alternative hypothesis of a systematic lag. In the presence of a systematic lag (µ < 0) the ability of hypothesis tests to reject the null hypothesis of no systematic lag ((H 0 : µ = 0)) depends on the sample size n, the ratio between the mean lag |µ|, the variance of the population, and on the precision of the measurement. Neither of these quantities is favourable in our case and thus, it is certainly possible that the null hypothesis cannot be rejected despite the 575 alternative being true.
Our main purpose was the consistent treatment of observational uncertainties and we have largely ignored the vibrant debate on the qualitative interpretation of the proxies. Surprisingly, we could not find any literature on the application of hypothesis tests to uncertain samples of the kind discussed here. The theory of fuzzy p-values is in fact concerned with uncertainties either in the data or in the hypothesis. However, it is not applicable to measurement uncertainties that are quantifiable in terms of 580 probability density functions (Filzmoser and Viertl, 2004). We have proposed to propagate the uncertainties to the level of the p-values and to then consider the expected p-values and the share of p-values which indicate significance, in order to decide between rejection and acceptance. The p-value measures the extremeness of a sample with respect to the null distribution and we hence regard the expected p-value to be a suitable measure for the uncertain samples' extremeness. In cases of high cost of a wrongly rejected null hypothesis, one might want to have a high degree of certainty that the uncertain sample actually 585 contradicts the null hypothesis and hence a high probability for the uncertain p-value to be smaller than α. In contrast, if the observational uncertainties are averaged out beforehand, crucial information is lost. The expected sample may either be significant or not, but the uncertainty about the significance can no longer be accurately quantified.
The potential of the availability of data from different sites has probably not been fully leveraged in this study. Naively, one could think of the NEEM and NGRIP (Ca 2+ , Na + ) lag records as two independent observations of the same entity. However, 590 given the discrepancies in the corresponding sample mean uncertainty distributions, changes in sea-ice cover and atmospheric circulation could in fact have impacted both cores differently. Proxy-enabled modeling studies as presented by Sime et al. (2019) could shed further light on the similarity of the signals at NEEM and NGRIP as a function of change in climatic conditions. Also, a comparison of the NGRIP and NEEM records on an individual event level could provide further insights how to combine these records statistically. There might be ways to further exploit the advantage of having two recordings of 595 the same signal.
We have presented a statistical reinterpretation of the high-resolution proxy records provided and analyzed by Erhardt et al.
(2019). The probabilistic transition onset detection also designed by Erhardt et al. (2019) very conveniently quantifies the uncertainty in the transition onset estimation by returning probability densities instead of scalar estimates. While the statistical 600 quantities 'combined estimate ' (Erhardt et al., 2019) and 'uncertain sample mean' (this study) indicate a tendency for a pairwise delayed transition onset in Na + and δ 18 O proxy values with respect to Ca 2+ and λ, a more rigorous treatment of the involved uncertainties shows that these tendencies are not statistically significant. That is, at the significance level α = 5% they do not contradict the null-hypothesis that no or the reversed transition sequence is in fact physically favoured. Thus, a pairwise systematic lead-lag relation cannot be evidenced for any of the proxies studied here. We have shown that if uncertainties 605 on the level of transition onset lags are averaged out beforehand, the samples of lags indeed appear to be significant, which underpins the importance of rigorous uncertainty propagation in the analysis of paleoclimate proxy data. We have focused on the quantitative uncertainties and have largely ignored qualitative uncertainty stemming from the climatic interpretation of the proxies. However, if the common proxy interpretations hold true, our findings suggest that, for example, the hypothesis of an atmospheric trigger -either of hemispheric or synoptic scale -for the DO events should not be favoured over the hypothesis 610 that a change in the North Atlantic sea-ice cover initiates the DO events.
Even though we find that the uncertainty of the transition onset detection combined with the small sample size prevents the deduction of statistically unambiguous statements on the temporal order of events, we think that multi-proxy analysis is a promising approach to investigate the sequential order at the beginning of DO events. In this study, we refrained from analyzing the lags between the different proxies in a combined approach and focused on the marginal populations. However, a combined 615 statistical evaluation -that is, treating the transition onsets of all proxy variables as a four-dimensional random variablemerits further investigation. Also, we propose to statistically combine measurements from NEEM and NGRIP (and potentially further ice cores) of the same proxy pairs. Finally, hierarchical models may be invoked to avoid switching from a Bayesian perspective in the transition onset estimation to a frequentist perspective in the statistical interpretation of the uncertain samples.
Finally, effort in conducting modelling studies should be sustained. Especially proxy-enabled modeling bears the potential to 620 improve comparability between model results and paleoclimate records. Together, these lines of research are promising to further constrain the sequence of events that have caused the abrupt climate changes associated with DO events.
Code and data availability. 10-year resolution time series of Na + and Ca 2+ concentrations and δ 18 O values from the NGRIP ice core shown in Fig. 1 are retrieved from PANGAEA (Erhardt et al., 2018, https://doi.org/10.1594. The high-resolution Na + and Ca 2+ concentration time series centered around DO transitions which were used to derive the time lags between the transition onsets 625 of the two proxies can be found in the same PANGAEA archive. The code used to generate the empirical densities of transition onsets is available at https://github.com/terhardt/DO-progression (last access: 21 October 2020). The code used to carry out the statistical analysis of the sample of empirical transition onset distributions is available from the authors upon request and will be published once the manuscript is accepted.
(Tipping Points in the Earth System) project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement 178 No. 820970. NB acknowledges funding by the Volkswagen foundation.
However, a classical numerical representation of this density on a discretized grid is inconvenient. Due to its high dimensionality for a reasonable grid spacing the number of data points easily overloads the computational power of ordinary computers.
E.g. representing each dimension with a minimum of 100 points would amount to a total of 10 12 data points. On top of that, 785 the application of any methods to such a grid is computationally very costly. Here, the MCMC sampler constitutes an efficient solution. By sampling a representative set {θ j } j from the posterior probability density it may be used to construct an empirical density in the sense of Eq. 4. For sake of simplicity in the main text we have formulated the methods in terms of continuous probability densities, although all computations in fact rely on empirical densities obtained from MCMC samples. Here, we show that all steps in the derivation of the methods can be performed equivalently under stringent use of the empirical density.

790
With regards to hypothesis tests, the use of empirical densities for the uncertain transition lag samples ∆T p,q i essentially boils down to an application of the tests to each individual value comprised in the respective empirical density.
For a given proxy and a given DO event, in a first step the MCMC algorithm samples from the joint posterior probability density for the models parameter configuration θ = (t 0 , τ, y 0 , ∆y, α, σ), giving rise to the empirical densityρ Θ (θ) = 1 m δ(θ− θ j ). Integration over the nuisance parameters then yields the marginal empirical density for the transition onset 795ρ p,i where the index i indicates the DO event and p denotes the proxy variable while j runs over the MCMC sampled values. We use bars to mark empirical densities in contrast to continuous densities. The uncertainty distribution for the lag ∆T p,q i between the variables p and q as defined by Eq. 7 may then be approximated as follows (omitting the index i): Thus, the empirical uncertainty distribution for the time lag is induced by the set of all possible differences between members of the two MCMC samples for the respective transition onsets For this study m = 6000 values have been sampled with the MCMC algorithm for each transition under study. This yields m 2 = 36 · 10 6 potential values for the empirical ∆T uncertainty distribution. To keep the computation efficient, the sets of lags were restricted to combinations k = l and thus to 6000 empirical values. We thus approximatē This drastic reduction of values certainly requires justification, which we give later by comparing final results of the analysis to those obtained from control runs. The control runs analogously construct the empirical densities for the transition onset lags from 6000 out of the 36 · 10 6 possible values, but use randomly shuffled versions of the original sets of transition onset times for the variables p and q: Here, s and s denote randomly chosen permutations of the set {1, 2, ...., m}.
As in the main text, in the following we denote uncertain quantities with a hat. For a given proxy pair the starting point for the statistical analysis however, is the uncertain sample ∆T p,q = (∆T p,q 1 , ..., ∆T p,q n ), which is characterized by the n dimensional uncertainty distribution ρ ∆T p,q (∆t p,q ) = ρ ∆T p,q i (∆t p,q i ). Its empirical counterpart is given bȳ This empirical density is comprised of m n possible values for the n dimensional random vector ∆T p,q and again, a substantial reduction of the representing set is required for practical computation. Defining the reduced empirical density for ∆T p,q as constrains the set that determinesρ ∆T p,q (∆t p,q ) to m values, where those values from different DO events with the same MCMC index j are combined: 825 ∆t p,q j = (∆t p,q 1,j , ..., ∆t p,q n,j ).
Again, the validity is checked by randomly permuting the sets {∆t p,q i,j } for the individual DO events with respect to the index j before the set reduction in the control runs.
Having found a numerically manageable expression for the empirical uncertainty distribution of the sample ∆T p,q it remains to be shown how the hypothesis tests can be formulated on this basis. If {∆t j } j denotes the set of n dimensional vectors 830 forming the empirical uncertainty distribution for the sample of lags obtained from n DO events, then the naive intuition holds true and the corresponding set {φ j = φ(∆t j )} j represents the empirical uncertainty distribution of the test statistic and correspondingly {p φ (φ j )} j characterizes the uncertain p-value. In the following, we examplarily derive this relation for the t-test -the derivations for the WSR and the bootstrap test are analogue.
The empirical uncertainty distribution for a sample ∆T induces a joint uncertainty distribution for the samples mean and standard deviation Then, the empirical uncertainty distribution for (Û,Ŝ) can be written as The (u j , s j ) that form the empirical uncertainty distribution are simply the mean and standard deviation of those ∆t j = 845 (∆t 1,j , ∆t 2,j , ..., ∆t n,j ) that form the vector valued empirical uncertainty distribution for ∆T. FromρÛ ,Ŝ (û,ŝ), the empirical uncertainty distribution for the uncertain test statisticẐ can be computed as follows: This shows, that for a given empirical uncertainty distribution for a sample of time lagsρ ∆T (∆t) = 1 m m i=1 δ ∆t − ∆t j , the corresponding distribution for the test statisticẐ = z(∆T) is formed by the set {z(∆t j )|j ∈ [1, m]} where each ∆t j is a 850 vector in n dimensions. The uncertain (left-handed) p-value remains to be derived fromρẐ(ẑ): Finally, the practical computation of the uncertain p-values boils down to an application of the test to all members of the set ∆t j that originates from the MCMC sampling used to approximate the posterior probability density for the ramp parameter configuration Θ. For the WSR test the expression can be derived analogously. The bootstrap test bears the particularity that each ∆t j induces its own null distribution. Yet, the application of the test to each individual ∆t j induces a set of p v,j = p v (∆t j ) that determines the empirical densitȳ As explained in Sec. A, we drastically reduce the cardinality of the sets that form the empirical densitiesρ ∆T p,q (∆t p,q ) at two points in the analysis. First, for the representation of the uncertain time lag ∆T p,q i between the proxies p and q at a given DO event, only 6000 out of the 6000 2 possible values are utilized. Second, the set of vectors considered in the representation ofρ ∆T p,q (∆t) = 1 6000 6000 j=1 δ(∆t p,q − ∆t p,q j ) is comprised of only 6000 out of the 6000 16 theoretically available vectors. To cross-check the robustness of the results obtained within the limits of this approximation, we applied our analysis to a 865 control group of 9 alternative realizations of the empirical uncertainty density for ∆T p,q for each proxy pair. The control group uncertainty densities are constructed as follows: First, the empirical uncertainty distributions for the event specific lags ∆T p,q i are obtained via Eq. A6. In a second step, the joint empirical uncertainty distribution for ∆T p,q is constructed from randomly shuffled empirical sets ∆t p,q i,si(j) of each DO event: ρ ctrl ∆T p,q (∆t p,q ) = 1 m m j=1 n i=1 δ(∆t p,q i − ∆t p,q i,si(j) ). (B1)

870
Here s i denotes an event specific permutation of the index set {1, ..., 6000}. Thus the empirical ∆t p,q i,j recombine between events and give rise to a new set of 6000 vectors that constitute 6000 empirical realizations of the uncertain ∆T p,q .
The results obtained from the control runs show only minor deviations from the results presented in the main text and thus confirm the validity of the reduction of the corresponding sets. Tab. B1 summarizes the results obtained by application of the hypothesis tests to the control group.

875
Appendix C: Computation of the uncertain sample mean In the main text, we stated that the uncertain sample mean is given by the pairwise convolution of the individual uncertainty distributions that describe the uncertain sample members. Here, we show how the uncertain sample mean can be computed if the individual uncertainty distributions are known.
Consider n random variables which are independently, yet not identically distributed: in analogy to the ∆T p,q = (∆T p,q 1 , ..., ∆T p,q n ) with ∆T p,q i ∼ ρ ∆T p,q i (∆t p,q i ) from the main text. Further, let . Results obtained from the application of hypothesis tests to the control group. Reported are the mean p-values E(p(∆T)) together with the probability of the uncertain sample to be smaller than the significance level π(p(∆T) < 0.05) and the p-values of the expected samples p(E(∆T)) for all three tests. All results were derived from the corresponding empirical densitiesρ ∆T p,q (∆t p,q ). The column sub-labels z,w, and bs indicate results obtained from the t-test, the WSR-test and the bootstrap test. The results presented in the main text are given by the p − q − 0 run for each proxy variable.
proxies run E(P ) π(P < 0.05) p(E(∆T))   denote the mean of the sample of random variables, which is in turn a random variable by itself. In order to compute the distribution ρ U (u) du we introduce the variable V = nU and the sequence of variables such that V n = V . From C4 it follows that and hence where v 0 = 0. With V n /n = U the distribution for the uncertain sample mean reads ρ Vn (v n ) dV n = ρ Vn (nu) n du = ρ U (u) du (C8) and thus