General comments:
This is a re-review of Eley et al., who present the OPTIMAL model for converting GDGT distributions to SST. In the first round of review I brought up a number of issues which came down to 1) tone of the paper, 2) concerns about over-fitting and/or mathematical clarity and 3) lack of applications. In reading the revised version I think that the tone is now much more appropriate and respectful (with a few small exceptions noted in the specific comments below). I am still concerned about overfitting. The authors argue that their in-sample validation technique (leave out a portion of the coretop data for validation iteratively) is proof enough that the method is not overfitted. This is not the case, because they are still validating on the training dataset (the coretop data). Out-of-sample validation is the gold standard. This is one reason why I suggested last time that they add a number of applications of OPTIMAL to downcore datasets and compare the results to previous calibrations. They now provide an example in Figure 14 (application to ODP 806 and ODP 850), but this Figure needs comparison with previous calibrations (i.e. BAYSPAR) so that the readers can see the differences. It is not an ideal choice b/c the alkenones at ODP 806 are saturated for much of the record. The authors need to add more examples beyond this one. I suggest 1 high-resolution late Quaternary record, and 1 deeper time record (Eocene perhaps). There should be data out there that have all of the fractional GDGTs, or else they can email the authors and get this. Note that Figures 13 and 14 are not really helpful - it is very difficult for readers to take much away from these. Paleoceanographers need to see applications. This is standard for new calibration papers and not, as the authors claim, out of the scope of the study. Only through this out-of-sample testing can we assess whether OPTIMAL is performing well.
Regarding mathematical clarity, the paper is still lacking on this front. I noticed there are more details in the SOM - this should all be in the main text, including Fig. S1 which is incredibly helpful. I suggest reorganizing their Methods section so that everything is much more clear and straightforward: describe the math, define the terms. It is also now evident that inversion of the GPR model is Bayesian. This should be made explicit in the main text along with a description of the priors used for SST. It also seems that the model can in fact be extrapolated (?) but that the uncertainties will be high unless an informative prior is made. Could you use constraints from other proxies here? That might provide some hope for deep-time users.
All in all, it will be much easier to understand the steps that the authors took if the math is laid out in detail, along with (as I asked for last time) a description of what platforms and codes were used (Matlab, Python packages).
Also, my questions about collinearity (the fact that the GDGT predictors are not independent but in fact correlated with each other to a high degree) and possible regresssion dilution in the GPR model (or some other problem - maybe with the prior) were not explicitly addressed.
I welcome the inclusion of a comparison of Dnearest to BIT and MI, but deltaRI should be here too. In practice this is one of the most valuable metrics for identifying aberrant GDGT distributions.
Finally, it seems like the authors misunderstood my comment last time about providing some first-order constraints on their model. I was advocating for enforcing monotonicity, i.e. more rings = higher temperature. I was not advocating for a particular form for that (linear vs. non-linear). A monotonic constraint still seems appropriate to me. Couldn't this be built into OPTIMAL? Why would one not want to do so? The authors mention culture studies, most of which find more rings at higher T's. Even Bale et al. 2019 find more cren and cren' at higher T's, as expected. It seems like we have enough evidence in favor of monotonicity.
If the authors can revise their paper to make the math clear, discuss any limitations as appropriate, and demonstrate that OPTIMAL can perform reasonably well on out-of-sample data (downcore time series) that would make me (and I suppose all readers) much more comfortable in terms of using this new approach.
Specific comments:
Line 131: "Like Qin et al. (2015), we note the non-linear nature of the individual experiments in Wuchter et al. (2004; see Fig. 5)." Need to clarify here that you mean Wuchter et al's Figure 5 and not your own. However this statement is deceiving. What Figure 5 in that paper shows is no response of TEX to SST b/t 5-15 degrees, and then a linear response thereafter in the series I incubation, plus no response in series II. It is not a non-linear (i.e. exponential) relationship. Combined with Schouten '07, the mesocosm data are linear b/t 10C and 40C (Schouten et al., 2007, Figure 4). I agree that this does not preclude a non-linear relationship in the real world - but it's important to not misstate what the data show.
Line 150: "As such, these are collectively more representative of the community production contributing to samples in the global core-top TEX86 calibrations of Kim et al., (2010) and BAYSPAR (Tierney & Tingley, 2014), which predominantly sample continental margin environments, rather than deep ocean / pelagic environments." I don't agree with this. How could a single-strain culture be more representative that environmental samples, which likely reflect multiple strains? There is no way that it could be. We don't know what strains contribute to the coretop dataset but it is certainly more diverse than just N. maritimus. Please remove.
Line 164: "To use the responses of single, selected archaeal strains in culture to validate a particular model of community-level responses to growth temperature is problematic even in the modern system (Elling et al., 2015)." I agree, and this directly contradicts the statement on Line 150.
Line 195" "Powerful mathematical tools." I asked you to please delete this last time, as it is hyperbolic and non-specific. The tools here are no more powerful than other mathematical approaches. Replace with "an analysis of distance metrics", "machine learning" "GPR" or something similar.
Line 252: "For example, it may be that sea surface temperatures are very sensitive to one observable." I think you mean the reverse here - the observables are the GDGTs, and they might be more or less sensitive to SST. This paragraph would be easier to understand if you just use the term "each GDGT" vs. "observable".
Eq. 7: The description of this distance metric is not totally clear. Can you clarify what x and y are here? Also as pointed out by Yi Ge, there are only five degrees of freedom, so doesn't this need to be adjusted accordingly?
Lines 289-300: To what extent is the gain in information from the individual GDGTs due simply to the use of more than one parameters? As I pointed out in my first review, adding more parameters will improve RMSE but doesn't necessarily mean an improvement in skill. This should be addressed here. You can't say that the NN predictor "outperforms" TEX unless you rule out the effect of using more parameters, which incidentally are also not independent from each other.
Line 345: This R^2 of 0.83 is identical to what BAYSPAR can do with all of the data (R^2 = 0.84, Figure 5 in TT14). So it seems like both the random forest and BAYSPAR model perform similarly, even though BAYSPAR uses TEX86. Worth noting.
Line 372: The authors haven't answered my question yet about the effects of collinearity (the correlation of the fractional GDGT abundances with each other). This seems like a good place to clarify this issue.
Line 425: The DCs look like they are showing evidence of the "horseshoe effect" common to standard PCA, in that branches B and C are part of the same horseshoe. This would signal a strong linear response in the multivariate data that isn't well-separated (?) When this occurs in standard PCA, the PCs are not interpretable. Can the authors comment here on this as it applies to the diffusion map technique?
Line 431: What do you make of the directionality change in the DC - TEX relationship at the high end?
Line 477: "it does not account for the systematic uncertainty in the model when extrapolated beyond the calibration range" It does in that all possible regression lines are extrapolated, creating wide error bars and coalescing towards the prior if no other information is there. I would change to, "it still assumes a monotonic relationship between TEX and SST" which is a more accurate. However I don't think this is a bad assumption (see comment above).
Lines 481-486: So if I understand this correctly, the inversion of your model is Bayesian inference with priors placed on...what? This section needs some mathemetical description to make this clear.
Line 488: "The clear pattern in the residuals does not necessarily indicate model misspecification, since no explicit noise model is specified for temperatures". I noted this last time - this looks like regression dilution. But if the model was specified as TEX = f(SST) + error this shouldn't happen, unless...the prior is too tight? Please clarify what your priors are.
Line 521: This is the first mention of Dnearest, but it is not defined. Is Dnearest the same as D(x,y) described above? Please clarify.
Line 569: Bale et al. did observe an increase in cren and cren' in their culture experiments though, which suggests that there is some response of GDGTs to temperature, albeit not well-expressed in TEX86.
Line 592: Figures 13 and 14 don't really communicate well how the Optimal model performs. It would be more useful to show a couple of time series and compare Optimal vs. BAYSPAR.
Line 610: It's not necessary to italicize "parametric" here. I would rephrase this to more accurately describe the difference b/t BAYSPAR and Optimal: "In contrast, while uncertainty bounds do increase when BAYSPAR is used to extrapolate beyond the modern calibration, they are not as large as Optimal because BAYSPAR still makes an assumption of a linear increase in SST at higher TEX values." As I said last time, BAYSPAR does account for model uncertainty, the issue is that the model we use is a linear one.
Line 615: deltaRI should be included here. It is arguably the most useful metric for ID'ing strange GDGT distributions.
Figure 16: I don't think this is an ideal application b/c UK37 is at its limit here at 806. Plus optimal should be compared to the previous calibrations (BAYSPAR, TEXH if you want, although the regression dilution in TEXH makes things hard to interpret).
SOM: This should be in the main text, esp. Figure 1. Also the SOM appears to contain comments and incomplete references (refs).
-Jess Tierney |