Testing Hypotheses About Glacial Dynamics and the Stage 11 Paradox Using a Statistical Model of Paleo-Climate

To test hypotheses about glacial dynamics, the Mid-Brunhes event, and the stage 11 paradox, we evaluate 10 the ability of a statistical model to simulate climate during the previous ~800,000 years. Throughout this period, the model simulates the timing and magnitude of glacial cycles, including the saw-tooth pattern in which ice accumulates gradually and ablates rapidly, without nonlinearities or threshold effects. This suggests that nonlinearities and/or threshold effects do not play a critical role in glacial cycles. Furthermore, model accuracy throughout the previous ~800,000 years suggest that changes in glacial cycles associated with the Mid-Brunhes event, which occurs near the 15 division between the out-of-sample period and the in-sample period, are not caused by changes in the dynamics of the climate system. Conversely, poor model performance during MIS stage 11 and Termination V is consistent with arguments that the ‘stage 11 paradox’ represents a mismatch between orbital geometry and climate. Statistical orderings of simulation errors indicate that periods of reduced accuracy start with significant reductions in the model’s ability to simulate carbon dioxide, non-sea-salt sodium, and non-sea-salt calcium. Their importance suggests that the 20 stage 11 paradox is generated by changes in atmospheric and/or oceanic circulation that affect ocean ventilation of carbon dioxide.


3
Values for the RMSE and statistical differences between simulated values and values from the proxy record indicate that the model generally performs well during the in-and out-of-sample period. We interpret this general accuracy to indicate that: 1.
Nonlinear relations,threshold effects, and/phase-specific governing equations do not play a critical role in glacial cycles. 5 2. Glacial cycles are driven by the same dynamics before and after the MBE.

3.
Terminations in general -and the 'stage 11 paradox' in particular -may be caused by changes in atmospheric circulation and/or the extent of sea ice, which affects the ventilation of the deep ocean and ultimately, affects the atmospheric concentration of carbon dioxide.
These results and the methods used to obtain them are described in five sections. Section 2 describes the data and 10 methods used to generate and analyze the simulations. The results are described in section 3. Section 4 interprets the results relative to the three hypotheses described previously, and section 5 concludes.

Methods
The CVAR model described by KJ2013 is simulated in a dynamic simulation (equivalent to a dynamic forecast) 15 conditioned on orbital geometry alone over the 791 thousand years before the present (kyr BP). Simulated values ( " # ) are subtracted from the corresponding values from the proxy record ( # ) to calculate simulation errors # = # − " # .
Simulation errors ( t) are analyzed three ways. First, we compute the root mean square error (RMSE) to evaluate model accuracy over pre-defined periods. Second, simulation errors are analyzed to identify periods when the model fails systematically, either in a single time step (outlier) or during two or more consecutive time steps (persisting 20 errors). Third we examine the statistical ordering among simulation errors (and the explanatory power of simulations that are generated by conditioning the model on endogenous variables) to evaluate competing hypotheses for the 'stage 11 paradox,' which is a significant mismatch between orbital geometry and climate associated with marine isotope stage (MIS) 11, 424 -375 kyr BP (Imbrie et al., 1993).

2.1Model Data
The four series used to represent orbital position, the six series used to represent climate, and the four series used to represent physical and biological mechanisms that link the six climate variables to each other and orbital geometry are the same as those used in KJ2013 (Table 1). KJ2013 uses four series to represent the effect of orbital geometry: conditions, but can be converted to global values by assuming that a scaling factor, which is derived from a limited set of observations can be applied across all observations (Masson-Delmotte et al., 2010;Masson-Delmotte et al., 2006). The *+ data that are used to proxy ice volume, which also includes information about deep water temperature 5 (Chappell and Shackleton., 1986;Shackleton, 2000), are derived from 57 cores drilled by the Deep-Sea Drilling Project and Ocean Drilling Program across the globe (Lisiecki and Raymo, 2005). Sea surface temperature is constructed using alkenones from site PS2489-2/ODP1090 in the sub-Antarctic Atlantic. Data for sea level are reconstructed using oxygen isotope records from Red Sea sediments (Siddall et al., 2003).
These six variables are linked to each other and orbital position via physical and biological mechanisms that are 10 represented by the four proxy variables. Fe is derived almost entirely from terrestrial sources and proxies changes in atmospheric circulation and a so-called iron fertilization effect, which may enhance the biotic uptake of CO2 (Martin, 1990). Sulfate SO4 originates mainly from marine biogenic emissions of dimethylsulphide (after removing sea-salt sources using the Na data), and so proxies marine biological activity (Cosme et al., 2005). It is included to represent the possible effect of iron-containing dust on biological activity and/or the effect of biological activity on atmospheric 15 CO2. Sea salt sodium Na is derived from the sea-ice surface and proxies the extent of winter sea-ice (Wolff et al 2003).
It is included to represent the possible effect of sea ice on the flow of CO2 from the ocean to the atmosphere (Stephens and Keeling, 2000). Non sea-salt calcium Ca has a terrestrial origin (mainly Patagonia) and may represent changes in temperature, moisture, vegetation, wind strength, glacial coverage, or changes in sea level in and around Patagonia (Basile et al., 1997), a locale thought to play an important role in glacial cycles.

20
To make these data amenable to a statistical analysis, we convert them to a common time scale (EDC3) using conversions from Parrenin et al., (2007) and Ruddiman and Raymo (2003). Unevenly spaced observations are interpolated (linearly) to generate a data set in which each series has a time step of 1 kyr (Miller, 2019). To eliminate the effects on inverting matrices with elements that differ greatly in size (due to different units of measurement), each of the fourteen time series is standardized as follows: where is the value (in original units), is the average value over the in-sample period, and ( ) is the variance over the in-sample period.

30
The equations used to estimate the CVAR model in KJ2013 are given by: in which # is a 10 × 1 vector that includes the ten endogenous variables; ∆ is the first difference operator (∆ # = # − #B* ), -is an error term with mean value zero and variance Ω that is normally, independently, and identicially distributed.
The condition that the conditional process ( # | # ) is nonstationary is formulated as a reduced rank hypothesis on the in which is a 10 × matrix of coefficients, which describe the rate at which the ten climate variables adjust back towards equilibrium after the system has been pushed away by exogenous shocks (i.e. changes in orbital geomtery); is the number of cointegration relations given by the reduced rank of the Π matrix; and is a × 15 matrix of cointegration coefficients that define the r stationary deviations from long-run equilibrium relationships, the so called cointegration relations, I # . Maximum likelihood estimates for the elements of the and matrices as reported by 10 KJ 2013 are given in section II of the Supplemental Material. The model in KJ2013 is estimated as a partial system (Johansen 1992, Harbo et al., 1998, Juselius 2006 where orbital variables are weakly exogenous. Here we simulate the estimated model model over the full time period using a dynamic simulation in an open model, conditioned on the (strongly) exogenous orbital variables # (equivalent to a dynamic forecast). To simulate climate during the in-and out-of-sample periods, the ten endogenous variables x are expressed as a function of the exogenous 15 solar variables and shocks to the climate system by inverting Equation (2) into the moving average form: where = X (1 − Γ * ) B* X ; X is a 10 × (10 − ) matrix orthogonal to describing the stochastic trends and X is a 10 × (10 − ) matrix orthogonal to determining how the stochastic trends load into the climate variables; L is the lag operator (for example, # = #B* ); * ( ) and V * ( ) are stationary lag polynomials; V is 10 × 4; and the 20 matrices are functions of the parameters ( ?, *, Γ *, , ). Based on the ten cointergating relations reported by KJ2013 r = 10, then C = 0, the in-and out-of-sample simulations are based on model (2) subject to (3) by setting # = 0 which implies that the simulated variables, " # , are calculated from the exogenous drivers, V # , ( ? ∆ # ), the dynamics attached to them, V * ( )∆ #B* , ( * ∆ #B* ), and the internal climate dynamics * ( ) # (Γ * Δ " #B* , IY Z[\ ).

35
We use RMSE as a simple heuristic to compare the model's predictive accuracy during the in-and out-of-sample periods. Because accuracy may vary over time, we use an indicator saturation technique [R-package gets Pretis et al., 2018;Castle et al., 2015] to identify periods during which the simulation significantly deviates from observations (i.e. simulation errors are statistically different from zero). Outliers refer to a statistically significant difference in the 6 simulated value of variable x relative to the observed value for a single time step, while persisting errors are statistically significant differences that persist for two or more consecutive time-steps. Outliers and persisting errors are evaluated for every possible time step. Here, we retain only those outliers or persisting errors that exceed the pα = 0.001 threshold.
This tightly controls the false-positive rate of detected periods of model failure. The method used to identify outliers and persisting errors are summarized in Supplementary Section III. This approach is used to assess the time-varying 5 performance of climate models (Pretis et al., 2015), the forecast accuracy of economic predictions (Ericsson 2017), as well as to detect volcanic eruptions in temperature reconstructions in both simulated climate data (Pretis et al., 2016) and proxy-reconstructions (Schneider et al., 2017).

10
If model performance does not change over time, we expect outliers and persisting errors to occur randomly throughout the sample and be equally likely in each sub-sample. We use this assumption to compare the distribution of outliers and persisting errors between in-sample and out-of-sample periods and among nineteen marine isotope stages. For each thousand-year time step, we count the number of variables that exhibit an outlier or persisting error.

15
Following this procedure, the maximum number of outliers or persisting errors for any single time-step is ten. These sums (and values for individual variables) are assigned to the in-or out-of-sample period or individual marine isotope stages.
To evaluate the distribution of outliers and persisting errors between the in-and out-of-sample periods and among marine isotope stages, we test whether their occurrence is different from a uniform random distribution (expected 20 under the null-hypothesis of equal performance) using a Pearson chi-square test (P), which is calculated as follows: in which n is the number of periods (n=2; in-sample j = 1; out-of-sample j = 2; or nineteen marine isotope stages), Oj is the number of outliers or persisting errors that are identified in period j, and Ej is the number of occurrences expected in period j.

25
The number of occurrences expected in period j (Ej) is calculated based on the null hypothesis that outliers or persisting errors are distributed uniformly among periods. This null implies that the expected value ( f ) can be calculated as: in which Yr is the number of thousand-year time steps in period j for which observed values are available and n is the number of periods for which observed values are available for the 791 kyr simulation period. P is evaluated against a 30 m distribution with n-1 degrees of freedom. If the test rejects the null hypothesis that outliers or persisting errors are distributed randomly among periods (i.e. some periods are simulated more/less accurately than others), the more accurate subsample is identified by the numerator of Equation (5) ( f − f ). A negative value during the in-sample period (( * − * ) < 0) would indicate that the number of outliers or persisting errors detected during the in-sample period is less than expected by a uniform random distribution. This result would suggest that the model generates a 35 more accurate simulation during the in-sample period. Equations (5) and (6) also are used to test whether outliers or persisting errors are distributed randomly across the nineteen marine isotope stages (n=19) that fall within the 791 kyr simulation. The first observation is 791 kyr BP, which falls in MIS 19.

Causes for Model Failure
To evaluate the cause(s) for model failure, we test whether poor performance 'starts' with a specific variable(s) and whether this failure is communicated to the other variables through long-and short-run relations among endogenous 5 variables. To identify the variable(s) that initiates the poor performance, we formalize techniques that are used by previous analyses. Previous analyses estimate a regression equation that specifies a dependent variable as a function of lagged values for an independent variable thought to 'precede' the dependent variable. For example, Li et al., (1998) conclude that CO2 'precedes' *+ based on regression results that indicate *+ is related to five lagged values of CO2.

10
But this approach is incomplete (from a statistical perspective) because it ignores the autocorrelation structure of the dependent variable. To account for this effect, we use a technique developed by Granger (1969) that is used to analyze relations among climate variables during the instrumental temperature record (e.g. Kaufmann and Stern, 1997;Stern and Kaufmann, 2014). For this application, we estimate the following regression: 15 in which Di,t is an indicator variable that equals one if the simulation error for variable i during period t is statistically different from zero (i.e. is a persisting error) (Di,t = 0 otherwise), is an error term (assumed to be normally distributed), and , , , , are regression coefficients that are estimated using ordinary least squares. The number of lags (s) is determined using the Akaike Information criterion (Akaike, 1973). Equation (7) is estimated ten times, once with the simulation error for each endogenous variable on the left-hand side. We expect the coefficients generally 20 to be statistically different from zero because simulation errors generally are correlated across variables, however, we are interested whether during the periods of simulation failure (as given by Di,t = 1), persisting errors for other endogenous variables propagate through the system, pre-dating/predicting persisting errors in the endogenous climate variable being modelled. Because the level of significance of selection in the first stage (pα ≃0.001) makes falsepositives in Di,t unlikely (approximately 1 outlier to be expected spuriously on average), the detection of breaks in the 25 first stage probably has little effect on tests on Di,t in this second stage. We repeat this process without the simulation errors for sea level because the first observation for sea level (462 kyr BP) is much more recent than the other time series (Table 1), which limits the sample range when all ten simulation errors are analyzed using Equation (7).
For each simulation error for variable i ( ), we estimate Equation (7) ten times. In each, we eliminate the simulation errors for one of the ten endogenous variables interacted with its non-zero mean simulation dummy 30 ∑ -,#Bf -,f -,#Bf u fS* .. This restriction is evaluated using an F-statistic that tests the null hypothesis that the persisting errors for the endogenous variable eliminated from Equation (7) have no information about the dependent variable beyond the additional variables included. These variables include the lagged values of simulation errors, the persisting simulation errors for the other endogenous variables, and the four exogenous variables for orbital geometry. Rejecting this null hypothesis allows us to state that the model's inability to simulate the endogenous variable that is eliminated 35 from Equation (7) (as indicated by persisting errors) precedes the simulation errors for the endogenous climate variable on the left-hand side of Equation (7).

5
For both the in-and out-of-sample periods, Figure 1 suggests that the model generally captures the timing and magnitude of persistent changes in climate that are described by glacial cycles, which frequently are summarized by changes in land ice volume (Ice). For this variable, the model generally simulates the timing and magnitude of glaciations and terminations, including the gradual accumulation of ice and its rapid ablation (i.e. the saw-tooth pattern). Furthermore, there are no skipped obliquity/precession beats (other than MIS 11). Finally, the model's ability 10 to simulate glacial cycles during the out-of-sample period is inconsistent with speculation that the CVAR model's ability to reproduce the ten climate/physical variables during the in-sample period simply reflects the model's ability to reproduce the data used to estimate the coefficients. Instead, the ability of the model to simulate climate during the out-of-sample period suggests that its coefficients capture relations among orbital geometry and the ten climate/physical proxies that govern the climate system beyond the sample period.

In-vs. Out-of-Sample Comparisons
The similarity between the model's accuracy in-and out-of-sample ( Figure 1) is consistent with comparisons of root mean square error ( Figure 2). As expected, the RMSE for the out-of-sample period generally is larger than the RMSE 20 for the in-sample period. But much of this increase is associated with MIS 11, most of which occurs during the outof-sample period ( Figure 1). If we eliminate MIS 11 from the out-of-sample period, the RMSE of the in-and out-ofsample periods are similar ( Figure 2). The outsized effect on the RMSE for the out-of-sample period is consistent with the 'stage 11 paradox.' Tests indicate that we cannot reject the null hypothesis that outliers are distributed randomly between the in-and out-25 of-sample periods (Table 2). A test statistic χ m (1) = 0.09 fails to reject (p > 0.76) the null hypothesis that as a group, outliers for the ten climate/physical variables are distributed randomly between the in-and out-of-sample periods.
Conversely, a test statistic χ m (1) = 52.5 rejects (p < 0.001) the null hypothesis that as a group, persisting errors for the ten climate/physical variables are distributed randomly between the in-and out-of-sample periods.

Comparisons Among Marine Isotope Stages
Outliers and persisting errors are not distributed randomly among the nineteen marine isotope stages ( Figure 3, Table   2). This result is generated in part by the 'stage 11 paradox.' If this stage is eliminated from consideration, we cannot reject the null hypothesis that outliers for variables other than methane are distributed randomly among the remaining 35 eighteen stages. Similarly, the RMSEs across variables are very similar in and out-of-sample when MIS 11 is excluded ( Figure 2). Conversely, the number of persisting errors is not distributed randomly among the nineteen marine isotope stages, even if errors in MIS 11 are excluded (Table 2). https://doi.org/10.5194/cp-2020-58 Preprint. Discussion started: 11 May 2020 c Author(s) 2020. CC BY 4.0 License. 9

Causes for Model Failure
Applying the p = 0.05 threshold to the tests that evaluate restrictions on Equation (7), sixteen of the one hundred tests reject the null hypothesis that lagged values for persisting errors (interacted with the non-zero dummy variable D) have no information about current values for the simulation errors on the left-hand side of Equation (7) beyond the 5 right-hand side variables that remain in Equation 7 (Table 3). For the eighty-one tests run on the nine endogenous variables (other than sea level), the null is rejected eleven times (Table 4). In both cases, the number of rejections observed is greater than the number expected due to repeated testing at p = 0.05, five and four rejections, respectively.
Together, these results suggest that the test results reveal information about the statistical ordering of simulation errors.

Nonlinearties and/or threshold effects drive the timing and magnitude of glacial cycles
A recent review of terminations states "Terminations clearly represent a strongly nonlinear response to regional changes in the seasonality of solar radiation (Past interglacials Working Group of Pages, 2016)." We test this statement 15 by using the CVAR to evaluate hypotheses about the importance of thresholds (e.g. Paillard, 1998;2001;Ganopolski et al., 2016;Tzedakis, et al., 2017), nonlinearities (e.g. Tziperman et al., 2006), or governing equations that vary by phase of the glacial cycle. If any of these play an important role, the CVAR model, which does not include their effects, will not be able to simulate glacial cycles.
The CVAR model is largely linear. Both long-and short-run relations among variables are linear. The only non-linear 20 relation is given by the fractional rate at which variables adjust to disequilibrium in the long-run relations ( ). But this nonlinearity is constrained by the fact that the fractional rate of adjustment is constant and applies during all phases of the glacial cycle.
Despite its largely linear specification, the CVAR generally simulates the timing and magnitude of changes in ice volume (and other variables) without any skipped beats other than stage 11. Furthermore, this linear specification 25 allows the model to simulate the saw-tooth pattern by which ice volume builds slowly but melts rapidly. These results suggest that non-linear relations, thresholds, or changes in governing equations are not important drivers of glacial cycles. This suggestion does not reject their presence, rather, Occam's razor implies that nonlinearities, threshold effects, and/or phase-specific governing equations are not needed to simulate important aspects of glacial cycles.
Furthermore, the CVAR's ability to simulate climate during the out-of-sample period is inconsistent with the 30 hypothesis that "glacial cycles would exist even in the absence of the insolation changes (Tziperman et al., 2006)." If glacial cycles exist independently of changes in orbital geometry, a statistical model that is conditioned only on orbital geometry and spun up with no memory of previous cycles would not be able to simulate glacial cycles accurately during the initial out-of-sample period. As in Gonapolski and Calov (2011), the ten variables come to an equilibrium and do not change thereafter if orbital geometry is held constant. Furthermore, the accuracy of the out-of-sample 35 simulation is inconsistent with the argument that changes in solar insolation account for less than 20 percent of the variance in glacial temperature records (Wunsch, 2004). https://doi.org/10.5194/cp-2020-58 Preprint. Discussion started: 11 May 2020 c Author(s) 2020. CC BY 4.0 License.

The Mid Brunhes Event
The demarcation between the in-and out-of-sample period (391 kyr BP) falls close to the Mid-Brunhes event, MBE (Jansen et al., 1986). Compared to the in-sample period used to estimate KJ2013, the pre-MBE out-of-sample period has; (1) lower concentrations of CO2, (2) glacial cycles with a smaller amplitude, and (3) cooler but longer interglacial 5 periods (EPICA, et al., 2004;Luthi et al., 2008;Hoenisch et al., 2009). These three changes beg the question, do they represent a change in the dynamics that drive glacial cycles and/or a change in the drivers of glacial cycles. The latter is supported by Yin (2013), who concludes, "through a set of internal mechanisms insolation alone induces a systematic difference between the interglacials before and after the 430 kyr ago in some ocean processes that are critical for the carbon cycle." Conversely, Tzedakis et al., (2009) argue 'astronomical forcing alone cannot explain 10 the difference in interglacial intensity before and after the MBE." Our model simulations contradict the latter, that the MBE represents a change in the dynamics that drive glacial cycles.
As indicated in Figure 1, the single set of relations among orbital geometry and the climate system embodied in the CVAR model simulates the different characteristics of glacial cycles before and after the MBE. As such, the MBE is not a transition between regimes; rather there is something unique about the MBE in particular and MIS 11 in general.  Imbrie et al., (1993) describe 'the stage 11 paradox' as a significant mismatch between orbital position and changes in climate associated with MIS 11 in general and termination V (430 -415 kyr BP) in particular. The latter is defined 20 by the maximum in benthic *+ of MIS 12 and the benthic *+ plateau of MIS 11 (Broecker and van Donk, 1970).

Mechanisms for the Stage 11 Paradox
These periods are unique: Termination V is the longest of any during the previous half million years (Berger and Loutre, 1996;Droxler et al., 2003;Loutre and Berger, 2003;McManus et al., 2003;EPICA Community Members, 2004;Rohling et al., 2010;Liseicki and Raymo, 2005;Ruddiman, 2007). MIS 11 also is the longest period of prolonged, stable warm climate in the North Atlantic (Oppo et al., 1998;McManus et al., 1999;2003). Finally, many 25 areas have air and sea surface temperatures that reach values consistent with interglacial periods even though large areas of the Earth's surface are covered by ice (Ruddiman, 2007). Despite these large changes in climate, the changes in orbital geometry are small. Consistent with this seeming mismatch, the CVAR model does a poor job of simulating termination V in particular and MIS 11 in general. Figures 1-3 indicate that MIS 11 has more variables with persisting errors than any other 30 period, either in-or out-of-sample (as well as driving the higher RMSE out-of-sample). This indicates MIS 11 is a prolonged period during which the model is not able to use the four variables for orbital geometry to simulate climate, which is the definition of the 'stage 11 paradox.'

35
The CVAR's model's poor performance during MIS 11 could be caused by difficulties in orbital tuning. The insolation peak for MIS 11 occurs in the middle of the warm stage therefore, orbital tuning delays the interglacial peak in *+ contains fewer tie points that can be used to anchor the chronology (Desprat et al., 2005), which means that the orbitally tuned chronology of MIS 11 is less secure than other warm stages (Candy et al., 2014). As such, the model's failure during this period may simply represent the poor quality of the chronology to which the simulation is compared.
To evaluate whether the stage 11 paradox is an artifact of the poor quality of the chronology, we condition the model on some of the endogenous variables that are thought to play an important role in glacial cycles. Conditioning a model 5 on observed values for one or more endogenous variables always will improve performance (Oreskes et al., 1994), but the variable used to condition the model will have little effect on model performance if the model's poor performance during MIS 11 is caused by the poor quality of the chronology because no endogenous variable will have more/less information about the poor chronology. Contrary to this expectation, model performance during stage 11 depends on the variable used to condition the model. Conditioning the model on observed values of CO2 or Na allows 10 the model to simulate more of the decline in Ice (and more accurately simulate other variables) throughout MIS 11, including termination V (Figure 4). Conversely, conditioning the model on observed values for SST, which is thought to play an important role in MIS 11 (see below), does not improve the model's ability to simulate the interglacial in MIS 11. Although this failure may be explained by stronger latitudinal or meridional gradients in sea surface temperature (Kandiano et al., 2012), large variations in accuracy that depend on the endogenous variable used to 15 condition the model suggest that the model's failure during MIS 11 is not caused solely by weaknesses in orbital tuning.

20
The mechanisms and sequences that generate the 'stage 11 paradox' cannot be fully identified by the CVAR model because it greatly simplifies physical relations and it has a relatively coarse temporal resolution (1 kyr). Conversely, its ability to accurately simulate glacial cycles (except MIS 11) using orbital position alone allows the CVAR model to test competing hypotheses about the 'stage 11 paradox' by identifying exceptions to the model sequences that accurately simulate terminations other than termination V. In other words, the statistical ordering of simulation errors 25 allows us to identify what is unique about MIS 11 (and termination V) and whether these differences play an important role.
Explanations for terminations in general -and stage 11 in particular -share several components. Many start with a change in meridonal overturning circulation and a bipolar seesaw that create a negative correlation between changes in hemispheric temperatures. Specifically, terminations may start with changes in orbital position that add freshwater 30 to the North Atlantic, this freshwater melt slows Atlantic meridonal overturning circulation (Elliot et al., 2002;McManus et al., 2004;Oppo et al., 1995;Vidal et al., 1997), and this slowdown creates a nearly simultaneous change in sea surface temperatures in the Southern Hemisphere via the bipolar seesaw (Barker et al., 2009;Broecker 1998;1986;Schmittner et al., 2002;Stocker and Johnson, 2003). In addition to an opposite change in sea surface temperature, there is evidence that changes in buoyancy (Watson and Garabato, 2006), latitudinal shifts in the 35 Westerlies (Anderson et al., 2009;Ninnermann and Charles, 1997;Toggweiler et al., 2006), and/or a changes in sea ice (Stephens and Keeling, 2000) affect the flow of CO2 from the southern Ocean, which is an important reservoir for glacial/interglacial CO2 (Knox and McElroy, 1984;Sarmiento and Togeweiler, 1984;Seigenthaler and Wenk, 1984;Anderson et al., 2009;Skinner et al., 2013). These competing hypothesis for terminations in general and stage 11 in particular can be tested by the statistical ordering of the model errors. If changes in sea surface temperature initiate Termination V, the model's inability to 10 simulate termination V will 'start' with its inability to simulate SST. This inability will be indicated by simulation errors for SST that precede and have information about the simulation errors for other variables. Specifically, simulation errors for other variables, such as CO2, will not have prior information about the errors for SST and these errors will have prior information about the errors for the other variables, such as CO2.
The statistical ordering of simulation errors indicates that the simulation errors for SST do not precede the model's 15 inability to simulate MIS 11 and termination V. Errors for SST are preceded by the persisting errors for other variables (read across the SST row in Tables 3 and 4), such as CO2, and the persisting errors for SST do not have prior information about the simulation errors for any variables (read down the SST column) at . Using a threshold ≤ 0.10, there is some evidence that persisting errors for SST have information about Ice. Consistent with these results, conditioning the model on SST, which eliminates the simulation errors for SST, does not improve the model's ability 20 to simulate Ice during MIS 11 relative to other potential causes for the stage 11 paradox (Figure 4). In toto, these results suggest that model failures do not 'start with' an inability to simulate sea surface temperature; rather the failure to simulate sea surface temperature is caused by the inability to simulate some other variable(s). As such, changes in sea surface temperature probably are not ultimately responsible for the 'stage 11 paradox.' Instead, the statistical ordering generated by Equation (7) Figure 4). This supports the argument that high concentrations of CO2 are responsible for the warm interglacial during MIS 11 (Yin and Berger, 2012). Together, these results suggest that terminations in general, and termination V in particular, are driven by changes in atmospheric carbon dioxide. Furthermore, they are consistent with the notion that

13
But the model's inability to simulate MIS stage 11 may not start solely with an inability to simulate CO2. Persisting errors for Ice also are preceded by persisting errors for Ca and Fe (proxies for wind strength and aridity Section 2.1 and Supplemental Section I). And the persisting errors for Ca are preceded by the persisting errors for Na (a proxy for sea ice in the southern ocean Section 2.1 and Supplemental Section I) Although results cannot resolve the timing of the model's inability to simulate wind (Ca, Fe) and sea ice (Na), their importance suggests that the model's inability 5 to simulate the long interglacial of MIS 11 is generated in part by the model's inability to simulate the location and strength of winds, the extent of sea ice, and/or the ventilation of CO2 from the Southern Ocean.

10
Our model is able to accurately simulate entire glacial cycles for an out-of-sample period that does not prescribe GHG forcing: the simulation is driven only by changes in orbital geometry. This ability suggests that the model can accurately hindcast climate using known climate parameters, which is the criterion proposed by Tzedakis et al., (2009) for understanding the current climate and where it is headed. Although satisfying this criterion has to be interpreted with caution because predictability is not necessarily informative about the quality of a model with respect to capturing 15 underlying causality (see e.g. Oreskes et al., 1994, or Clements andHendry, 2005), the ability to hindcast climate suggests that our model could supplement the search for analogues for the Holocene (11,700 years before the present through the present), many of which focus on MIS 11 (Droxler et al., 2003;Tzedakis, 2010;Pol et al., 2011). Despite some similarities, our results suggest that such efforts are fraught with difficulty. Most importantly, the statistical model cannot use the four measures of orbital geometry to simulate the depth and length of the interglacial that is 20 associated with MIS 11. Conversely, the model is able to simulate many aspects of the current warm period (Figure 1 & 3): notable exceptions include peristing errors associated with Ice and SST (see below). This implies that any similarity in orbital geometry and feedback mechanisms (Imbrie et al., 1992;19932006) do not automatically translate into similar climates. As such, there probably are important differences between the Holocene and MIS 11.

25
Ironically, the interglacials during MIS 11 and the Holocene may share an important similarity: an important role for carbon dioxide. The inability to simulate the interglacial in MIS 11 is likely caused by a poorly-modelled physical mechanism that raises atmospheric carbon dioxide. It is highly unlikely that this mechanism is related to human activity, even though MIS 11 contains the first evidence for the use of fire by people in Britain (Gowlett, 2005;Preece et al., 2006). Conversely, others argue that Holocene warming is amplified by anthropogenic emissions of carbon 30 dioxide and methane 2005;2007).
Rather than trying to decide which aspects of the paleoclimate record 'line up' across marine isotope stages (e.g.  simulation error is an (innovational) outlier. The light gray area is the out-of-sample forecast period; MIS 11 is shaded dark gray. (b) same as above for carbon dioxide, (c) same as above for methane, (d) same as above for land ice, (e) same as above for Na, (f) same as above for SO4, (g) same as above for sea level, (h) same as above for SST 1 .  Ice volume ( 18 ) Thousand years before present   .7 ** Value rejects the null hypothesis at p < .05 ( * ), p < 0.01 respectively ( ** ). Blue indicates the out-of-sample simulation is more accurate than the in-sample simulation; red indicate the in-sample simulation is more accurate. Values in brackets indicate the number of outlier/persisting errors in the out-of-sample period relative to the number of outlier/nonzero mean errors in the in-sample period.. A large value implies that the in-sample simulation has significantly fewer outlier/persisting errors, which would make it more accurate than the out-of-sample simulation.
https://doi.org/10.5194/cp-2020-58 Preprint. Discussion started: 11 May 2020 c Author(s) 2020. CC BY 4.0 License. Table 3: p-values for test of significance on " (equation 8) for the sample that includes all endogenous variables. Red indicates rejection of the exclusion (p < 0.05) of lagged errors of other series, blue indicates rejection of the exclusion (p < 0.10) of lagged errors of other series, and green indicates rejection of the exclusion for the autoregressive lags ( p < 0.05). The red value of 0.023 in the second column of the first row indicates that the indicates that the simulation errors for CO2 have information about the simulation errors for temperature.