A statistical approach to the phasing of atmospheric reorganization and sea ice retreat at the onset of Dansgaard-Oeschger events under rigorous treatment of uncertainties

For previous glacial intervals, concomitant shifts in different proxy records from Greenland ice cores indicate repeated abrupt climate transitions comprising among others abrupt warming, a sudden reorganization of the atmospheric circulation, and a retreat of perannial sea ice. The physical mechanism underlying these so-called Dansgaard-Oeschger (DO) events remains debated. Former studies have made an effort to deduce the progression of temperature, circulation, and sea-ice changes at the onset of DO events from paleoclimate proxy records to constrain potential triggering mechanisms. In this con5 text, recent research reports on systematically delayed transitions in Na concentrations and δO values compared to Ca concentrations and the annual layer thickness by about one decade. This is interpreted as a temporal lag of sea ice retreat and Greenland warming with respect to atmospheric reorganization at the onset of DO-events. Here, we present a comprehensive statistical analysis of the relative phasing of DO transitions in Ca and Na concentration records from the NGRIP ice core for the period 60 10kyr BP. Regarding the time lags identified in this period as a sample generated from an unknown popula10 tion, we derive probability density functions for the sample and population mean and test the null-hypothesis of a simultaneous transition. Special attention was paid to the uncertainties inherent to the transition onset detection in noisy data. Their rigorous propagation changes the test results from significant to non-significant and therefore a purely stochastic origin of the observed tendency for Ca to lead the transition cannot be ruled out. In fact, we show that the data is very likely to comprise both: DO events that were led by a Ca transition, as well as events led by a Na transition. Together, these findings clearly contradict 15 a systematic lead or lag between the DO transitions in the two proxies, and the apparent Ca lead should therefore not be interpreted as indication of a causal relationship. Under the assumption that all DO events followed the same physical mechanism and that the proxy interpretation holds true, the we conclude that at DO transition onsets, neither was the atmospheric reorganization caused by sea ice retreat, nor was the sea ice retreat triggered by atmospheric reorganization. 1 20 https://doi.org/10.5194/cp-2020-136 Preprint. Discussion started: 10 November 2020 c © Author(s) 2020. CC BY 4.0 License.

10-year resolution. Light blue vertical lines mark the timings DO events. Data retrieved from Erhardt et al. (2018) and DO event timings are according to Rasmussen et al. (2014).

Introduction
In view of anthropogenic global warming, concerns have been raised that several subsystems of the earth's climate system may undergo abrupt and fundamental changes of state if temperatures are to exceed corresponding critical thresholds (Lenton and Schellnhuber, 2007;Lenton et al., 2008Lenton et al., , 2019. Under sustained warming, the Atlantic Meridional Overturning Circulation (AMOC), the Amazon rainforest, or the Greenland ice sheet are possible candidates among others to abruptly transition to 25 new equilibrium states that may differ strongly from their current states (Lenton et al., 2008). Understanding the physical mechanisms behind abrupt shifts in climatic subsystems is crucial for assessing the associated risks and for defining safe operating spaces in terms of cumulative greenhouse gas emissions. To date, empirical evidence for abrupt climate transitions only comes from paleoclimate proxy records encoding climate variability in the long-term past. First discovered in the δ 18 O records from Greenland ice cores, the so-called Dansgaard-Oeschger (DO) events are considered the archetype of past abrupt 30 climate changes (see Fig. 1) (Johnsen et al., 1992;Dansgaard et al., 1993;Bond et al., 1993;Andersen et al., 2004). These events constitute a series of abrupt regional warming transitions that punctuated the last and previous glacial intervals at millennial recurrence periods. Amplitudes of the decadal-scale temperature increases reach from 5°C to 16.5°C over Greenland (Kindler et al., 2014;Huber et al., 2006;Landais et al., 2005). The abrupt warming is followed by gradual cooling over centuries or even a few millennia, before the climate abruptly transitions back to cold conditions. The relatively cold (warm) intervals within the 35 glacial episodes have been termed Greenland stadials (GS) (Greenland interstadials (GI)). GS typically persist over millennial time scale, before another abrupt warming starts a new cycle (Rasmussen et al., 2014). Despite being less pronounced, a global impact of DO events on climate and ecosystems is evident in manifold proxy records (e.g., Moseley et al., 2020;Buizert et al., 2015;Lynch-Stieglitz, 2017;Kim et al., 2012;Fleitmann et al., 2009;Voelker, 2002;Cheng et al., 2013). Apart from δ 18 O, Table 1. List of DO events (Greenland Interstadial onsets) for which high-resolution Ca 2+ and Na + concentration data for windows centered at the transition are available. Bold print indicates those events for which the stochastic, MCMC-based method successfully detected empirical density distributions for transition onset. ∆t indicates the expected time lag between the Ca 2+ and the Na + transition onsets. P (∆t > 0) gives the probability for events to exhibit a positive Ca 2+ -Na + lag, i.e., a sodium lead at the transition onset. Dates are according to Rasmussen et al. (2014). of any abrupt change in the data can be determined only with limited precision (see Fig. 2(a)). Besides these quantitative 75 uncertainties, interpretation of proxy data is always subject to qualitative uncertainty, since there is no exact mapping from the proxy variable to the climate variable that the proxy is supposed to represent.
We follow the lines of (Erhardt et al., 2019) and use a Bayesian Markov-Chain-Monte-Carlo (MCMC)-based algorithm to infer posterior probability distributions for the DO transition onset time t 0 in the set of Ca 2+ and Na + time series. This approach captures the uncertainties in the transition onset determination most conveniently and allows for the derivation of a 80 stochastic sample of Ca 2+ -Na + lags. Given this stochastic sample, we propose a method to estimate the mean of an underlying population that is assumed to have generated the small observed sample of lags. Additionally, we use standard tests to assess the significance of leads or lags between Ca 2+ and Na + transition onsets. A comparison is given between the application intervals of 250 to 500 years centered around DO events from the period 60-10 kyr BP. The data set covers all interstadial onsets from GI-17.2 to the Holocene, as determined by Rasmussen et al. (2014) (see Tab. 1). The time resolution decreases from 2 to 4 years with increasing depth in the NGRIP ice core due to the compression of the core. With Ca 2+ deposition rates reflecting the state of the atmospheric circulation and Na + concentrations being related to the sea ice extent in the North Atlantic, the comparison of these two proxies from the same archive gives insights into the sequence of events leading to DO 100 events (Erhardt et al., 2019, and references therein). Exact co-registration of the ion concentrations guarantees the absence of any relative dating uncertainties that would impede the approach. For details on the measurement process we refer to Erhardt et al. (2019).

Methods
Here, we present a generic yet detailed description of the methods we use to infer characteristics of a generating population from 105 a sample which is subject to uncertainty. First, we introduce the stochastic transition onset detection algorithm that by design returns an uncertain transition onset t 0 in form of a posterior probability density distribution. Application of this algorithm to the DO events present in the data from Sec. 2 yields an uncertain sample of transition onset lags ∆t = (∆t 1 , ..., ∆t 16 ) between the calcium and the sodium transition -the starting point for our statistical analysis. Second, a statistical perspective on the series of DO events is presented, with special emphasize on the particularity of two levels of randomness being involved. This 110 is followed by some remarks on the numerical treatment of high dimensional probability densities. Subsequently, inference of the population mean and application of hypothesis tests are discussed under rigorous propagation of the samples' uncertainties.
We highlight that situations similar to ours, where a small-to-medium-size sample generated from an underlying population is subject to uncertainty, may occur in many different disciplines and the content of this section may be adapted to such ρ emp (∆t) Figure 2. (a) Posterior probability distribution ρ(t0) for the onset of Ca 2+ and Na + transitions associated with the onset of GI-12c, derived from Ca 2+ (orange) and Na + (green) values around the GI-12c onset at 2-year resolution, using the probabilistic ramp-fitting shown in (b). The black lines in (b) indicates the mean of the ramps sampled in the MCMC process and the gray shaded area indicates the 5th-95th percentiles of these ramps. (c) Histogram associated with the empirical density of transition onset lags ∆t between the two proxies (violet), together with the corresponding Gaussian kernel density estimate (blue).
problems. For example, consider a drug study that is meant to test the effect of a medicine against fever. The difference in the 115 body temperature of the probands before and after the treatment can however be measured only with finite precision. Hence, the sample of temperature differences is uncertain in the very same sense as the sample of time lags between DO transition onsets of different proxies. In most practical cases the relative sample uncertainty may be small and correspondingly omitted.
Also, the same shape of uncertainty on each sample member typically allows for application of hierarchical models. It is the individuality in the sample members uncertainty that prevents the application of hierarchical methods in our case.

Transition onset detection
Given intervals of Ca 2+ and Na + concentration time series centered around DO events, application of an MCMC-based Bayesian ramp-fitting method developed by (Erhardt et al., 2019) yields posterior probability densities ρ t0 (t 0 ) for the transition onset time t 0 . The approach is based on modeling the transition present in the data as a linear ramp between two constant levels (see Fig. 2 a and b). Tab. 1 lists the 24 DO transitions for which data is available, with bold font indicating the 16 events 125 for which ρ t0 (t 0 ) was successfully derived for both proxies simultaneously. The probability densities for the transition onsets of both proxies during individual events allow to compute corresponding probability densities for the lag between them: In the following, ρ i (∆t i ) denotes the probability distribution for the lag of the i-th event under study. The index skips those events for which the transition detection method failed. Fig. 2 shows the transitions to GI-12c for Ca 2+ and Na + , together 130 with the derived probability distributions for the transition onsets (panel a and b) and the corresponding distribution of lags between the two (panel c). An overview of all probability density functions ρ i (∆t i ) for the transition onset lags of the 16 DO events is given in Fig. 4. In Tab. 1 the probability for each transition to be lead by Ca + is listed. Importantly, there is a clear tendency for Ca 2+ to lead the transition for many of the events, as already reported by Erhardt et al. (2019). The sample of lags ∆t = (∆t 1 , ∆t 2 , ..., ∆t 16 ) constitutes the data basis for this study. Due to its stochastic character, we refer to the sample 135 as an 'uncertain sample'. In the statistical analysis all of these events are treated equally even though some transitions might be more pronounced than others and the quality of the fit differs between the events. The ramp-fitting method does of course not return continuous probability densities but rather empirical densities based on a representative set of 6000 values {∆t emp i,j } sampled by the MCMC algorithm for each event. In general, empirical 140 densities approximate their continuous counterparts such that for large enough m and any well behaved function f . The densities ρ i (∆t i ) are derived from the empirical densities by means of a Gaussian kernel density estimator.
We consider the empirical densities generated with the Bayesian MCMC-based ramp fit algorithm most convenient to capture 145 the uncertainty in the detection of an abrupt transition onset. Further details of this method can be found in Erhardt et al. (2019).
In view of the different signal-to-noise levels in the calcium and sodium records, we carried out a performance test by applying the algorithm to synthetically generated time series of abrupt transitions perturbed by Gaussian noise of different amplitude.
As expected, the larger the noise amplitude, the less precise is the transition onset detection. Importantly, however, there is no systematic bias of the detected transition onset in the one or the other direction. For details, see Appendix A.

Statistical setting
Despite their diversity in terms of temperature amplitude, duration, and frequency across the last glacial, the reoccurring patterns and their common manifestation in different proxies suggest that the DO events follow a common physical mechanism.
Provided this is the case, the relative phasing between Ca 2+ and Na + transitions still involves randomness due to climate variability. From a frequentist perspective, each DO warming is drawn from a hypothesized, infinitely large population of 155 reoccurring DO events and the realization of a Ca 2+ -Na + lag can thus be regarded as a random experiment (Ω, F, P ∆t ) on the sample space Ω = R. Here, F is a σ-algebra defined on Ω and may be taken as the Borel algebra with P ∆t denoting a probability measure with respect to F that is called population. This said, the reasoning behind our statistical analysis is as follows: If the Ca 2+ transition was to trigger the Na + transition, the population P ∆t would not ascribe any probability to −20 0 20 x ρP (x) (a) x 1 x 2 x 3 x 4 x 5 x 6 ρi(yi) lags ∆t > 0. In particular the population mean µ ∆t would necessarily be less than zero. If we fail to exclude a population 160 mean µ ∆t ≥ 0 based on the sample ∆t, it cannot be guaranteed that this necessary criterion is fulfilled and, therefore, the data cannot serve as evidence for a Ca 2+ transition lead, let alone a causality between the two transitions. Thus, we focus on deriving qualified statistical statements on the population mean µ ∆t in view of the sample ∆t.
A population P is by itself not observable, but random samples X n = (X 1 , X 2 , ..., X n ) of independently and identically distributed random variables may be generated from P, where the randomness of each component of X n is determined by the 165 populations probability density ρ P : Inferring properties of P from a finite-size sample X n (which we call an n-sample in the following) is a well-studied problem in statistics and a broad range of methods is available. The sample of lags ∆t = (∆t 1 , ...∆t 16 ) introduced above can be regarded as such a sample generated from P ∆t . However, this description does not account for the uncertainty inherent to 170 each individual sample member ∆t i . Whenever a random sample is realized on a continuous sample space, its measurement necessarily involves uncertainty and adds a second level of randomness. In our case, this is due to the uncertainty quantified in the MCMC-based transition detection. Instead of a sample x n = (x 1 , x 2 , ..., x n ) comprised of scalar entries, one obtains what we call an uncertain sample Y n = (Y 1 , Y 2 , ...Y n ) of random variables Y i . The generation of a sample from a population and the corresponding two-fold randomness is illustrated in Fig. 3. We emphasize that the randomness of Y n is not the same as the randomness of the stochastic process under study, but instead originates from the imprecise measurement process. In this section we use capital letters to denote random variables and lower-case letters to denote realizations. A realized sample x n free of measurement uncertainty will be referred to as a certain sample in the following to avoid confusion. The components of Y n are independently, but not necessarily identically distributed. Therefore, all sample members require an individual The sample ∆t falls into the class of uncertain samples, with each component ∆t i being a realization of the same random process, yet carrying its individual measurement uncertainty represented by ρ i (∆t i ). Surprisingly there seems to be a lack of literature on how to treat these uncertainties when inferring the properties of P, as soon as the uncertainty of each sample member has its individual shape and hierarchical distributional models cannot be invoked. An obvious approach is to work with 185 expected values Y i = y i ρ(y i )dy i to obtain a scalar valued sample, which allows for immediate application of the typical statistical toolkit. However, we will show in the following that this approach is too simplistic and important information may be hidden by the averaging process. Before we discuss how the uncertainties may be treated and inference on the population from uncertain samples can be realized, we need to clarify how the high-dimensional probability distribution can be treated numerically.

Numerical treatment of high dimensional probability densities
Given empirical densities ρ emp i (y i ) = 1 m m j=1 δ(y i −y emp i,j ) of the members Y i of an uncertain n-sample based on representative sets {y emp i,j } of size m, formally the empirical density in R n reads However, this results in a sum with m n terms or in a vectorial representation ρ emp sentative set {y emp k } with m n members. Since in our case m = 6000 and n = 16, m n exceeds the computational memory by far. Therefore, we reduced the size of the representative set from 6000 16 to 6000 values by setting To check that Eq. 3 is not violated, we have carried out our analysis with a control group of 10 alternative realizations of ρ emp Y (y). The representative sets for the control group densities have been realized by randomly selecting 6000 vectors 200 {y 1 , ..., y 6000 } from the total possible 6000 16 vectors. Since all final results reported in the main text of this article fall inside the two standard deviation uncertainty environment of the control group, we consider our chosen approximation of ρ Y (y) to be justified. The results for the control group are presented in Appendix C.
3.4 Inference of the population mean from an uncertain sample As explained above, we are interested in the mean µ ∆t of the population P ∆t that presumably has generated the given sample 205 ∆t. In this section, we derive a way to directly infer a probability distribution for a population mean from an uncertain sample. First, consider an n-sample x n generated from a population P with the probability density ρ P (x). Without any further knowledge, the mean u = 1 n n i=1 x i of the sample is the best point estimate for the mean of its generating population. Let denote the mean of a corresponding uncertain sample. Being a function of random variables makes U (Y) a random variable 210 itself, with a probability density function given by In the above equation the δ-function selects the probability weight allocated to those y that are mapped to u. If ρ P is justifiably assumed or known to be normal, the estimation of the population mean can be refined one step further by drawing on the t-distribution. First, consider an n-sample X n with sample mean U = 1 n n i=1 X i and sample standard deviation to be realized from a normal population N (µ, σ) with population mean µ and population standard deviation σ. The quantity Z = U −µ S/ √ n is then distributed according to the t-distribution with n − 1 degrees of freedom (the proof is originally due to Student (1908) and can for example be found in Rice (2007)): irrespective of the precise values for µ and σ. Here, P (A) is an informal -though widely used -way to denote the probability 220 for the outcome A of a random experiment without previous definition of the sample space and the sigma-algebra. For a given pair (u, s) derived from a sample x n we can therefore interpret t n−1 (z) as the probability density that the n-sample originates from a normal population with mean µ. and The integrand on the right constitutes a probability density function for µ for any given (u, s). Since a and b can be freely chosen, one can replace u − as/ √ n and u − bs/ √ n with β and α, respectively. Combining these results we obtain (13) relies on an uncertainty-free tuple (u, s) and is in this form not suited to capture 230 uncertainties in the sample. Just as the mean U (Y), the standard deviation S(Y) of a random vector Y is a random variable itself. In general, for a given probability density ρ Y (y), the statistics U and S are not independently distributed and their joint probability density reads Based on ρ (U,S) an expected distribution ρ µ (µ) can be computed, such that any pair (u, s) contributes to ρ µ (µ) according 235 to its probability density: Eq. (15) takes account of all uncertainties inherent to Y. Replacing the continuous density with the empirical density defined in Eq. 7 leads to

Bootstrapping the distribution of the population mean
Given a certain n-sample x n , bootstrapping constitutes a complementary approach to derive a distribution for the population mean µ (e.g., Lehmann and Romano, 2006). The great advantage of the non-parametric bootstrapping method is that it does not 245 rely on any assumptions on P (such as normality). Generally speaking, the method assesses the uncertainty of statistics such as the sample mean computed from a certain sample. Here, we make use of the fact that the sample mean is the best point estimate of the population mean and hence an uncertainty distribution of the sample mean can equivalently be interpreted as an uncertain estimate for the population mean. In short, bootstrapping works as follows: a set of κ synthetic samples {x bs k | 1 ≤ k ≤ κ} is generated from the original sample x n by drawing n values with replacement from x n . Next, the mean u bs k = 1 n n i=1 x bs k,i of 250 each synthetic sample is calculated. The set of the κ means u bs k generates an empirical bootstrapped probability density for the population mean Given an uncertain sample Y represented by the empirical density ρ emp Y (y) = 1 m m j=1 δ(y −y emp j ) instead of a certain sample, the bootstrapping technique can be applied to all y emp j comprised in the empirical density, giving rise to m synthetic boot-255 strapped samples {u bs k,j |1 ≤ j ≤ m}, each inducing a distribution ρ bs µ,j (µ) according to Eq. 18. Averaging over these yields the general expression for the population mean distribution, where u bs j,k denotes the the k-th bootstrapped mean from the j-th member of the empirical density's representative set {y i }.

Hypothesis testing with uncertain samples
In order to identify a causal relationship between the transitions in the atmospheric circulation as indicated by Ca 2+ , and transitions in the sea-ice extent represented by Na + , we must be able to discriminate stochastic from systematic features in the data. In our case, this translates into ruling out that the given sample of time lags between the two proxy variables was generated from a population with mean equal to zero, corresponding to a simultaneous transition. In view of the apparent 265 tendency of Ca 2+ to lead the transition across most of the analyzed DO events (see Fig. 4 and previous work by Erhardt et al. (2019)) one-sided hypothesis tests addressing the population mean constitute a convenient method to provide evidence for a population mean less than zero and therefore a systematic time lag. The other way round, not being able to preclude population means equal to or greater than zero implies that causality cannot be evidenced in the data sample, though it does not prove the absence of causality between the two proxy variables.

270
Numerous statistical hypothesis tests have been designed to systematically rule out distributions which are unlikely to have generated a given sample. As noted above, the classic literature deals with certain samples, while we discuss the application of three different tests -all targeting the population mean -to an uncertain sample of the previously described kind. First, we present our approach how hypothesis tests can in general be applied to uncertain data samples. In short, we propose a propagation of the uncertainties to the p-value and introduce two potential decision criteria to map the resulting p-value 275 distribution ρ p (p(Y)) to a binary decision between rejection and acceptance of the null hypothesis. Subsequently, we discuss our choice of tests. Generally speaking, we wish to test how the populations of the transition onsets in Ca 2+ and Na + compare and, in particular, whether the transition onset of Na + lags the onset of Ca 2+ . Since we are given only relative data, that is, the difference ∆t in the transition onset times of the two proxies, we must rely on tests that can be applied to what are called paired samples or paired comparisons. This said, we find the paired t-test, the Wilcoxon-signed-rank (WSR) test, and a bootstrap test 280 suited for our purpose.

Testing uncertain data
Typically, a statistical hypothesis test draws on a test statistic φ(X): φ : R n → R; X → φ(X), whose distribution ρ H0 (φ) is known under the so-called null-hypothesis H 0 . The null-hypothesis constitutes a set of assumptions on the population P. Given a statistic of a certain sample realization φ(x 0 ) = φ 0 , a p-value is calculated which indicates the cumulative probability for obtaining a more extreme φ 1 (X 1 ) from a second sample X 1 , provided H 0 was true: The hypothesis H 0 is then rejected at a significance level α if the p-value is below the predefined significance threshold α.
In that case one also speaks of a statistically significant result. For a comprehensive description of hypothesis testing see 290 for example Lehmann and Romano (2006). While the application of hypothesis tests to a certain sample x is in general straightforward, for an uncertain sample it is not. A distribution of sample values ρ Y (y) first translates into a distribution for the test statistic ρ φ (φ) and from there to a distribution of p-values ρ p (p) as depicted in Fig. 3 with ρ p (p) featuring probabilities for both, p < α and p > α. One might be tempted to boil down the uncertain sample to its expected value Y before the application of the test to obtain a scalar p-value. In general, however, the p-value of the samples expectation does not equal the due to the non-linearities in both the statistic φ and in the mapping of φ to a p-value. In other words, the significance of a sample may change if its uncertainties are propagated into the space of the test statistic or the p-value. So what is a convenient and sound way to project a distribution of p-values derived from an uncertain sample Y to a scalar in order to take a meaningful 300 decision on the significance by comparing it to α?
We argue that the uncertainties should be propagated to the p-values, before any projection to a scalar is applied. The reason for this is that a distribution ρ Y (y) might assign finite probabilities for y-values that strongly differ from the expectation. Such values do not find any consideration if a test was applied to the expectation of Y. Consider the case of a bimodal distribution of Y with both bulks of the distribution lying in the rejection region of the hypothesis H 0 . The expected Y might yield 305 a non-significant test result as it is located somewhere between the two bulks while a propagation of the uncertainty would potentially indicate the significance of the uncertain sample. Given a distribution of p-values ρ p (p) that incorporates the original uncertainties, the following two criteria can be invoked to decide between acceptance and rejection of the Hypothesis: -The hypothesis shall be rejected at the significance level α if and only if the expected p-value is less than α, that is 1 0 ρ p (p)dp < α .

310
-The hypothesis shall be rejected at the significance level α if and only if the probability for p to be less than α is greater than a predefined threshold η (we propose η = 90%), that is 13 https://doi.org/10.5194/cp-2020-136 Preprint. Discussion started: 10 November 2020 c Author(s) 2020. CC BY 4.0 License.
As the expected p-value constitutes a convenient measure of the overall extremeness of an uncertain sample Y with respect to H 0 , its comparison to the significance threshold is meaningful. However, in some situations one might want to guarantee 315 that in fact the probability to achieve a significant test result is high -e.g. when mistakenly attested significance is associated with high costs. In these cases the second criterion is more appropriate even though it does not imply that the first criterion is fulfilled.
In the following we draw on the empirical density representation of Y in order to rigorously propagate the uncertainties to the space of p-values. A corresponding empirical density for the p-values may be obtained by with p j = p(y j ).

t-test
The t-test can be applied to a sample of differences {d i } derived from a paired sample {(x i , y i )} (Rice, 2007). The hypothesis H 0 in this case states that both x i and y i were generated from normal populations with the same mean µ x = µ y . This implies 325 that the population of differences is normally distributed as well, but with mean µ 0 = 0. Note that the standard deviations σ x and σ y may differ. For large samples, deviations from normality in the original populations can be tolerated and the hypothesis reduces to the equality of means. With n = 16 our sample constitutes an intermediate case. This is one reason why we use the WSR and bootstrap tests in addition to the t-test.
Recall the definition of z, which constitutes the test statistic for the t-test: The distribution of z under the null-hypothesis is given by the t-distribution t n−1 (z) of n−1 degrees of freedom. Setting µ 0 = 0 in the above equation and calculating the p-value p z from the left hand side only allows us to effectively test the hypothesis 335 may be compared to the chosen significance threshold.

Wilcoxon-signed-rank
Given a set of pairs {(x i , y i )} the Wilcoxon-signed-rank test (Wilcoxon, 1945) was originally introduced to test whether x = (x 1 , ..., x n ) and y = (y 1 , ..., y n ) stem from the same population (Rice, 2007). The test relies on the set of differences and the hypothesis H 0 states that the population of differences is symmetrically distributed around zero. We apply the test in a one-sided way, such that we can reject distributions centered around any value greater than zero in case of a significant outcome. The test statistic w(d = (d 1 , ..., d n )) for the Wilcoxon-signed-rank test is defined as where R(d i ) denotes the rank of |d i | within the sorted set of the absolute values of differences {|d i |}. The Heaviside function 345 Θ guarantees that exclusively d i > 0 are summed. Note that a significant w-value does not preclude an asymmetric distribution with mean zero but higher portions of probability located below zero. The corresponding one-sided p-value p w is given by the cumulative probability that the null hypothesis assigns to w -values smaller than a given w:

Bootstrap test 350
Given a sample of differences d = (d 1 , ..., d n ), a bootstrap test constitutes a third option to test the compatibility of the sample with the hypothesis that the population of differences features a mean equal to or greater than zero: H 0 := {µ 0 ≥ 0} . The advantage of the bootstrap test lies in its independence from assumptions regarding the distributions shape. Guidance for the construction of a bootstrap hypothesis test can be found in Lehmann and Romano (2006) and Hall and Wilson (1991). For our test, we bootstrap from the set of differences {d i −u(d)} shifted by the sample mean u(d) = 1 n n i=1 d i . We denote the arising 355 empirical distribution bŷ whereû bs k are the bootstrapped means shifted by the samples' mean value. This distribution can be regarded as a data-derived empirical distribution for the test statistic U compatible with the null-hypothesis. The bootstrap p-value p bs for the sample d is then given by the fraction of shifted bootstrap means that are smaller than the mean of the sample u(d): In order to generalize this test to the case of an uncertain sample Y the uncertain quantity U (Y) should be compared to a bootstrapped distribution that is consistent with the null hypothesis and simultaneously captures the uncertainties. The application of the shifted bootstrapping scheme to all samples y i comprised in the empirical density ρ emp Y (y) = 1 m m j=1 δ(y − y emp j ) yields a convenient empirical distribution of the test statistic compatible with the null hypothesis: with u emp j = u(y j ) as before. Subsequently, each u j shall be compared to the distribution of shifted bootstrap means to compute a corresponding p-value: giving rise to the empirical p-value distribution with p bs j = p bs (u j ). The test scheme for the certain sample could also be applied to all y i individually. However, combining the shifted bootstrapped distributionsρ bs u,j (u) to one overarching null hypothesis distribution for the test statistic ρ bs 0 (u) constitutes the more conservative approach for uncertainty propagation.
The three tests are applied in combination in order to compensate their individual deficits. If the population P ∆t was truly 375 Gaussian, the t-test would be the most powerful test, i.e., its rejection region would be the largest across all tests on the population mean (Lehmann and Romano, 2006). Since normality of P ∆t cannot be guaranteed, the less powerful Wilcoxonsigned-rank test constitutes a meaningful supplement to the t-test, relying on the somewhat weaker assumption that P ∆t is symmetric around zero. Finally, the bootstrap test is non-parametric and in view of its independence from any assumptions adds a valuable contribution.

Results
The statistical methods formulated in Sec. 3 provide a sound framework for the statistical analysis of the data described in Sec. 2. From application of the Bayesian MCMC ramp-fit method introduced by Erhardt et al. (2019), we obtained the empirical densities ρ emp i (∆t i ) = 1 6000 6000 j=1 δ(∆t i − ∆t emp i,j ) for the uncertain transition onset lag between Ca 2+ and Na + of the 16 DO events indicated in Tab. 1. Each empirical density is based on a representative set of size m = 6000. Additionally, continuous densities ρ i (∆t i ) were generated with a Gaussian kernel density estimator. We derived a 16-dimensional joint empirical density ρ ∆t (∆t = (∆t 1 , ..., ∆t 16 )) = 1 6000 6000 j=1 (∆t − ∆t emp j ) according to Sec. 3.3, drawing on representative sets of 6000 vectors ∆t emp j . Hence, the sample vector ∆t can be regarded as an uncertain sample as introduced in Sec. 3.2, generated from a population P ∆t with mean µ ∆t which is assumed to describe an hypothesized infinitely large pool of DO onset lags. The correspondence between ∆t and the generic uncertain sample Y = (Y 1 , ..., Y n ) allows for the immediate application 390 the statistical framework developed in Sec. 3.

Distribution of the population mean
With U ∆t = 1 16 16 i=1 ∆t i denoting the mean of the uncertain sample ∆t and µ ∆t denoting the mean of the population P ∆t , we derived distributions for the stochastic sample mean ρ U ∆t (u ∆t ) and the population mean ρ emp µ∆t (µ ∆t ) according to Eq. 9 distributions ρi(∆ti) (see inset) according to Eq. 9 together with the distribution for the population mean ρ emp µ ∆t (µ∆t) (blue) computed according to Eq. 15 and the bootstrapped distribution of the population mean ρ bs µ ∆t (µ∆t) (orange) according to Eq. 18. Clearly, the sample mean as well as the population are likely to be less than zero. The inset displays the probability densities ρi(∆ti) of the transition onset lags for all DO events under study, as listed in Tab. 1, with the index increasing along the rows. Most of the individual events feature elevated probabilities for a negative lag. Table 2. Statistical features of the distributions ρU ∆t (u ∆t ) for the sample mean of the uncertain sample ∆t, the population mean ρ emp µ ∆t (µ∆t) derived by means of the t-distribution and the bootstrapped distribution for the population mean ρ bs µ ∆t (µ∆t). q5,q50 and q95 denote the 5th, 50th and 95th percentiles of the distributions. The columns · and P ( · < 0) denote the expectation and the probability to be less than zero of the corresponding stochastic variable. and Eq. 17, respectively. The distributions are displayed in Fig. 4 together with the bootstrapped distribution ρ bs µ∆t (µ ∆t ) ac-395 cording to Eq. 19, which reproduces the population mean distribution accurately. ρ U ∆t appears to be a symmetrically squeezed version of the other two distributions. The expectations of these distributions are listed in Tab. 2 together with the 5th, 50th and 95th percentiles and the probabilities for U ∆t and µ ∆t to exceed zero. All three distributions exhibit almost the same mean value U ∆t µ ∆t −5.6 yr and a high degree of symmetry around this value. The equality of their expectations is consistent with the fact that the sample mean U ∆t is the best point estimate for the population mean µ ∆t . The observation 400 that ρ emp µ∆t (µ ∆t ) is much broader than ρ U ∆t (u ∆t ) can be explained by the fact that any sample mean u 0 is associated with an interval (a, b) allowing a normal distribution N (µ 0 , σ) with µ 0 ∈ (a, b) to generate u 0 with finite probability. A distribution of U consequently must be associated with an even broader distribution of µ. The strong agreement between ρ emp µ∆t (µ ∆t ) and ρ bs µ∆t (µ ∆t ) makes us confident that our approach is justified and robust. The probabilities for U ∆t and µ ∆t to exceed 0 are 13 % and 22 %, respectively.

Significance of the Ca 2+ lead
In addition to estimating the population mean µ ∆t , we applied the t-test, the Wilcoxon-signed-rank test, and the bootstrap test to the uncertain sample of Ca 2+ -Na + lags ∆t to eventually rule out a generating population with mean greater than or equal to zero which would correspond to a Na + transition onset lead or a simultaneous transition, respectively. Tab. 3 summarizes the results obtained by application of the tests under full consideration of the uncertainty, together with the p-410 values associated with the expected sample of time lags ∆t = ( ∆t 1 , ..., ∆t 16 ) with ∆t i = ∆t i ρ i (∆t i )d∆t i . The corresponding expected lags for each DO-event are listed in Tab. 1. Strikingly, all tests yield significant results at a significance Table 3. Results from the application of the t-test, the WSR test and a bootstrap (BS) test to the uncertain sample of DO transition onset lags ∆t. p denotes the expected p-value, derived from the uncertainty-propagated p-value distribution. The probability for a significant test results associated with the same distribution is indicated by P (p < 0.05). in view of the data we cannot rule out that the observed tendency for Ca 2+ to lead the transition is a purely stochastic feature and that the population of lags is actually centered around zero or even a value greater than zero. The histograms associated with the empirical p-value distributions are displayed in Fig. 5. We find broad ranges of p-values spanning from 0 to 1 for all three tests with elevated frequencies below the significance threshold of α = 0.05 that decay towards the higher p-values. The 420 probability for a significant result is highest for the WSR test (32 %) and equally low for the t-test and the bootstrap test (26 %).
The discrepancy between the p-values of the expected samples p( ∆t ) and the expectation of the uncertainty propagated p-value pρ emp p (p)dp is due to the non-linearities in both, the mapping of ∆t to the test statistic and the mapping of the test statistic to the p-value.

425
Following Erhardt et al. (2019) we derived an uncertain sample ∆t of the Ca 2+ − Na + lags at the DO transition onset from 16 events. There is an apparent tendency for the transition onset in Ca 2+ to lead the transition in Na + . Application of the t-test, Wilcoxon-signed-rank test and a bootstrap test to the expected sample of lags ∆t confirms the impression of a significant lag between the two proxies. However, if the uncertainties from the determination of transition onset timings are rigorously propagated to the corresponding p-values, the picture becomes more ambiguous with higher probability for non-significant 430 than for significant results across all employed tests. In fact, the proposed approach to project the p-value distributions to a binary decision fails to reject the hypothesis of a population with zero lag for all tests. Applied in combination, the tests compensate individual weaknesses and thus confer credibility to the conclusion that a population mean equal to zero -that is, a simultaneous transition of the two proxies -cannot be excluded. The derived shape of the distribution ρ U ∆t (u ∆t ) is in line with the results of the hypothesis tests. The tendency for a calcium lead at the transition onset remains, but with a probability of 13 % the sample features a Na + lead on average. The probability of the sample to stem from a population P ∆t with mean µ ∆t greater than zero was calculated to be 22 %. This said, the analyzed data cannot serve as an unambiguous evidence for a significant Ca 2+ lead at the DO transition onset with respect to Na + . The rigorous propagation of uncertainties qualitatively changes the interpretation of the data and the observed tendency towards a Ca 2+ lead might simply be a stochastic feature. As already metioned, we cannot infer an absence of causality from the fact that we cannot identify a systematic lag in the data.

440
Therefore, in the following we present an important argument that contradicts the idea that either of the two -atmospheric change and sea ice change -is an exclusive trigger of the other.
While the discussion has focused on the population mean µ ∆t so far, for the search of a DO trigger mechanism it is important to note that the order of causes and consequences should be preserved across all studied DO events, provided they were driven by the same physics. Hence, if atmospheric circulation changes (indicated by a Ca 2+ change) were to trigger the North Atlantic 445 sea ice retreat (corresponding to a Na + decrease), the changes in Na + should lag the changes in Ca 2+ across all analyzed DO events. If instead both, DO events with negative and positive lags between the two proxy variables can be evidenced, this would suggest that neither of the two exclusively triggers the other, or that the physical mechanism is not the same across all DO events. Following this thought, we calculated the probability for n Ca 2+ out of the 16 events to feature a Ca 2+ lead at the transition onset given by 20 https://doi.org/10.5194/cp-2020-136 Preprint. Discussion started: 10 November 2020 c Author(s) 2020. CC BY 4.0 License.
where N ∆ti<0 denotes the number of events with negative time lag. Since this probability is indifferent to the choice of the events with a Ca 2+ lead, we have to sum over all constellations with a Ca 2+ lead in n Ca 2+ events. Fig. 6 shows the results for P (N ∆ti<0 = n Ca 2+ ) together with the probabilities for the number of events with a Ca 2+ lead being greater or equal to n Ca 2+ , given by the cumulative probability: The probability for all (but one) events to be led by calcium is as small as 0.03 %(0.39 %). Only configurations with 8 to 13 transitions being led by Ca 2+ are associated with probabilities higher than 5 %, with a maximum of 22 % likelihood for the configuration with 10 or 11 events being led by Ca 2+ and 6 or 7 events being led by Na + , respectively. Also the cumulative probability surpasses 5 % only at n Ca 2+ = 13, implying that the probability that at least 14 events are led by Ca 2+ is still below 460 5 %. For n Ca 2+ = 10 the cumulative probability exceeds 50 %. Hence, it is more likely than not that at least 10 events exhibit a calcium transition lead. Finally, the probability for a minimum of 7 transitions with a Ca 2+ lead is already above 95 %. These results corroborate the conclusion that there is a tendency yet no guarantee for Ca 2+ to transition first across the DO events under study. The fact that there is only a 10 % chance that 13 or more transitions are led by Ca 2+ can be regarded as evidence for the presence of transitions with a Na + lead. Therefore, in terms of a physical interpretation, we can state that this data 465 contradicts the hypothesis that atmospheric changes exclusively trigger the North Atlantic sea ice retreat.
The question arises, which DO events are the ones that exhibit a Na + lead. However, since we are dealing with stochastic lags it cannot be answered ultimately based on the empirical data alone. In Tab. 1 the probability for Na + to lead the transition is listed for all events under study. A first review of the events with high probability does not reveal any systematic pattern, yet this topic merits further investigation.

Conclusions
We regarded the Ca 2+ -Na + transition onset lag distributions ρ i (∆t i ) for 16 DO events, derived from the data set provided by Erhardt et al. (2019), as an uncertain sample generated from a population P ∆t . Within this framework, we have shown that the data does not significantly contradict the hypothesis that the mean µ ∆t of a generating population equals or even exceeds zero when taking into account the uncertainties involved in the transition detection. Three different hypothesis tests, all targeting the 475 population mean, consistently yield non-significant results under the proposed decision criterion for application to uncertain samples. In agreement with this result, the derived distribution ρ µ∆t (µ ∆t ) features a probability of 22 % for the population mean to exceed zero under the assumption of normality. Even for the stochastic sample mean U ∆t we find a probability of 13 % to be larger than zero, as indicated by the distribution ρ U ∆t (u ∆t ). Despite the non-significance, our results confirm the tendency observed by Erhardt et al. (2019) of calcium to lead the transition. However, we must state that this tendency cannot 480 be discriminated from a purely stochastic feature and therefore -provided the physical proxy interpretation holds true -cannot serve as evidence for atmospheric changes to trigger sea ice retreat during DO events. In fact, by calculating the likelihood of lead-lag configurations across the 16 events under study, we find evidence for the data to comprise both, transitions led by Ca 2+ and transitions led by Na + . This finding implies that atmospheric changes cannot be viewed as an exclusive trigger mechanism for the North Atlantic sea ice retreat during DO events, because such a causal relationship would require the 485 temporal sequence to be preserved across all events. In the one-mechanism framework, our results could still be explained by an external trigger not represented by Ca 2+ nor Na + , which drives both atmospheric changes and sea ice retreat, with the atmosphere simply responding faster. Other than that, our findings question the assumption that all DO events followed the same physical pattern of causes and consequences. In order to gain more insight in the temporal order of events we suggest the following: First, studies similar to this one should be carried out targeting the phasing of other proxy variables. The stochastic white noise with different standard deviations σ and therefore exhibit different signal-to-noise ratios (SNR) ∆y σ . The black line is the mean over the ensemble of 6000 ramps sampled with the MCMC ramp-fit algorithm, while the gray shaded area indicates the 5th to 95th percentiles of the same ensemble. (b) Mean transition onsets t0 = ρ emp (t0) t0 dt0 derived from the ensembles sampled by the MCMC ramp-fit algorithm for synthetically generated noisy ramps with different signal to noise ratios. The error bars indicate the 5th to 95th percentiles for t0.
Appendix A: Sensitivity of the transition detection Na + transitions exhibit a lower signal to noise ratio than corresponding Ca 2+ transitions for many of the events under study.
Therefore, we carried out a performance test for our stochastic transition detection by applying the algorithm to synthetically 630 generated linear ramps disturbed by Gaussian white noise. To investigate the influence of the noise, we kept the ramp amplitude ∆y constant and step-wise increased the standard deviation σ of the perturbing noise. As can be seen in Fig. A1, the detection of the transition onset becomes less precise with decreasing signal to noise ratio ∆y σ . That is, the mean of the posterior distribution ρ emp t0 (t 0 ) may differ stronger from the true transition onset for lower signal to noise ratios while simultaneously the standard deviation of ρ emp t0 (t 0 ) increases. Importantly, there is no systematic bias of the detected transition onset in the one or the other 635 direction. Also, we find that for most of the synthetic time series, the true transition onset lies inside the uncertainty interval of the derived distributions ρ emp t0 (t 0 ). Thus, the algorithm is suited for application to data with different signal-to-noise ratios. The lower signal to noise ratio is reflected in the broader distribution for t 0 computed by the algorithm.

Appendix B: Distribution of the uncertain statistics
In Sec. 3.6 we discuss the propagation of the sample uncertainty to the p-value of any given hypothesis test. Certainly, this 640 relies on an intermediate propagation of the uncertainty to the corresponding test statistic. Fig. B1 shows the density histograms displayed as well. The histograms are not supposed to coincide with the null distributions. Clearly, for all three tests, the histograms of the uncertain test statistics empirical probability densities comprise both, significant and non-significant values of the statistic.
Appendix C: Results of the analysis for the control group As explained in Sec. 3.3, we drastically reduce the set of vectors considered in the representation of ρ emp ∆t (∆t) = 1 6000 6000 i=1 δ(∆t− 650 ∆t emp i ) from 6000 16 theoretically available vectors to 6000 vectors ∆t i . To cross-check that the results obtained within the limits of this approximation, we applied our analysis to a control group of 10 alternative realizations of the the probability density ρ emp ∆t,j (∆t) = 1 6000 6000 i=1 δ(∆t − ∆t emp i,j ), all drawing on different representative sets {∆t emp i } j with i ∈ {1, 2, ..., 6000} and j ∈ {1, 2, ..., 10}. These sets have been generated by drawing 6000 vectors at random from the pool of the theoretically available vectors. We find that all original results that rely on the empirical density ρ emp ∆t (∆t) fall well into the uncertainty range Table C1. Results obtained from the application of hypothesis tests to the control group. Reported are the mean p-values p(∆t) together with the p-values of the expected samples p( ∆t ) and the probability of the uncertain sample to be significant P (p < 0.05) for all three tests. All results were derived from the corresponding empirical densities ρ emp ∆t,j (∆t) (labeled by rj). Also listed are the mean and the standard deviation (std) for each quantity across the control group. For comparison, the results from the original analysis are given as well. determined by the control group. Tab. C2 summarizes the results derived from the control group members for the population mean, while Tab. C1 shows the results obtained by application of the hypothesis tests to the control group. Table C2. Expected values for the population mean derived from the control group probability densities together with the corresponding probability for the population mean to be less than zero. The left part of the table refers population mean distributions obtained by inverting the t-distribution, while the right part refers to population mean distributions obtained from bootstrapping sample means from the empirical probability distribution of the uncertain sample. Also listed are the mean values and standard deviations across the control group and for comparison the results from the original probability density as reported in the main part of this article.