the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Can machine learning algorithms improve upon classical palaeoenvironmental reconstruction models?
Abstract. Classical palaeoenvironmental reconstruction models often incorporate biological ideas and commonly assume that the taxa comprising a fossil assemblage exhibit unimodal response functions of the environmental variable of interest. In contrast, machine learning approaches do not rely upon any biological assumptions, but instead need training with large datasets to extract some understanding of the relationships between biological assemblages and their environment. We have developed a twolayered machine learning reconstruction model MEMLM (Multi Ensemble Machine Learning Model). The first layer applies three different ensemble machine learning models of random forests, extra random trees and lightGBM, trained on the modern taxon assemblage and associated environmental data to make reconstructions based on the three different models, while the second layer uses multiple linear regression to integrate these three reconstructions into a consensus reconstruction. We consider three versions of the model: 1) A standard version of MEMLM, which uses only taxon abundance data, 2) MEMLMe, which uses embedded assemblage information, using a natural language processing model (GLOVE) to detect associations between taxa across the training dataset and 3) MEMLMc which incorporates both taxon abundance and assemblage data. We train these MEMLM model variants with three high quality diatom and pollen training sets and compare their reconstruction performance with three weighted averaging (WA) approaches of WACla (classical deshrinking), WAInv (inverse deshrinking) and WAPLS (partial least squares). In general, the MEMLM approaches, even when trained on only embedded assemblage data, perform substantially better than the WA approaches under crossvalidation in the larger datasets. However, when applied to fossil data, MEMLM and WA approaches sometimes generate qualitatively different palaeoenvironmental reconstructions. We applied a statistical significance test to all the reconstructions. This successfully identified each incidence where the reconstruction is not robust with respect to the model choice. We find that machine learning approaches can outperform classical approaches, but can sometimes catastrophically fail, despite showing high performance under crossvalidation, likely indicating problems when extrapolation occurs. We find that the classical approaches are generally more robust, although they can also generate reconstructions which have modest statistical significance, and therefore may be unreliable. We conclude that crossvalidation is not a sufficient measure of transferfunction performance, and we recommend that the results of statistical significance tests are provided alongside the downcore reconstructions based on fossil assemblages.
 Preprint
(2986 KB)  Metadata XML
 BibTeX
 EndNote
Status: final response (author comments only)

RC1: 'Comment on cp202369', Cajo ter Braak, 20 Oct 2023
General:
The purpose of the paper is unclear to me; it should perhaps be more focussed. One simple (?) purpose could be to demonstrate that particular machine learning methods can give more predictive power than classical approaches (analogue methods, WA, WAPLS, fxTWAPLS). Another one might be to show that combining/stacking predictors as in MEMLM improves prediction or robustness compared to using a single machine learning method (perhaps with the previous purpose included as secondary purpose). The current version does not give info on the second option (what is the utility of stacking) and immediately compares (three!) different MEMLM versions with WA and WAPLS. So the possible advantages of combining the predictions of two (or three) rather similar machine learning method remains unanswered. I would be vote for adding this info to the MS.
The reduced RMSEP using MEMLM in the bigger data sets is impressive. The different reconstructions that different versions generate signal issues. There is no attempt to provide standard errors for the (WA or consensus) reconstructions.
The different versions of MEMLM (and of WA?) may generate qualitatively different reconstructions while have similar RMSEP on the training data (Fig.4 gives a nice example with same trend but different level, why?). Could other statistics of performance that are in current usage in WAPLS like average bias and maximum bias and, in Liu et al 2020/2023, the regression slope, help to detect such issues? Did the different versions have different weights in their component predictors?
The paper uses fivefold cross validation without further specification, so presumably using random folds. I note that cross validation gives results, such as RMSEP, that depends on the random or chosen folds, and so gives variable results on reapplication. What is the error in RMSEP, so how should the reader interpret 9% improvement of one method compared to another. I note that the error is larger the smaller the number of folds. With a larger number of folds, the error is smaller but then geographic nearness of training samples is an issue (due to spatial autocorrelation giving pseudo replication) and should be avoided, done for example in Liu et al 2020/2023.
The application of GloVe (in either way I think you did it, see detailed comment on L86) is similar to an old way of doing reconstruction. First apply a correspondence analysis to the training data (or a principal components analysis to the training data or to the covariance matrix or innerproduct matrix obtained from the training data, which similar to the cooccurrence matrix used in GloVe) and then use the dimensions, i.e. the sample scores of the various axes, in a second step (in this paper called (integrated) embedding vectors), which used subsequently in a multiple linear regression step (similar to the random forest in the paper). See Roux 1979 cited in (ter Braak & van Dam 1989). If the full dimensionality, is used (instead of using only the first few dimensions), this type of approach should not be worse than one using the training data directly, as the embedding is/should be contain a full representation of the training data. In summary: the GloVE approach does an unconstrained ordination of the training assemblage data (i.e. without environment data) and uses the dimensions (ordination axes / embedding vectors) so obtained for supervised learning of environment as a function of ordination axes. In my view, this approach is not likely to be superior so more simple direct approaches.
I note that the GloVe model is loglinear with free row an column parameters, so that is an RCmodel in the sense of Goodman ((Goodman 1981; Ihm & van Groenewoud 1984; Goodman 1986; Goodman 1991)). Correspondence analysis is close to a loglinear model (Supporting Information to (ter Braak & te Beest 2022)) and both models are close to the Gaussian model (with equal tolerances) that was the basic motivation for WA and WAPLS.
GloVe uses the log(cooccurrences) for good reasons. Did you consider in the other approaches transformations of the assemblage taxon proportions? Could that not be handled naturally in your multiple ensemble approach?
Consider adding some motivation why you do not consider modernanalogue methods or, alternatively, limit your conclusions.
The Discussion also needs attention. I found a number of remarks that appear to lack (precise) support. Some of these may only be answered using simulated data as in (ter Braak et al. 1993).
Details:
Abstract
You use twice constucts like “three models of A, B and C”. I vote for “three models (A, B and C)”.
L19 “MEMLMe, which uses embedded assemblage information” The intended audience is not likely to know what embedded assemblage information means. Even a google search would not be of much help, I believe. Something with dimension reduction or unconstrained ordination of cooccurrence information might be more helpful.
L20 “MEMLMc which incorporates both taxon abundance and assemblage data.” What is the difference between taxon abundance data and assemblage data? Unclear.
L24 “embedded assemblage information” See L19.
L25 Why “However”?
L26 Make more clear that the different version of MEMLM also generated qualitatively different reconstructions.
L26 Here you switch from present to past tense. Put all in past tense.
L28 “catastrophically fail” Where did you find this and why this emotional term?
Introduction
L50: should > could
L51 Delete “However,”
L5154 Optimal methods and algorithms can be derived by making assumptions. The GloVe paper by Pennington et al gives a nice example! But, algorithms themselves never make assumptions (except for data properties, e.g. they may fail to work on negative data or symbolic data).A particular algorithm is only motivated from/ derived from assumptions. For example, taking the mean does not assume a sample from a normal distribution (even for P/A data and Poissonian counts the mean is optimal!), but it is a fine summary of the location of data when such assumptions hold true. WA has been derived from the unimodal model for the ecological niches of taxa (with equal niche breadths), but it does not assume such models; it may well work well for other (strictly compositional) data, but one does not know whether it is best in some sense. Even modern analogue methods are ‘based on’ assumptions. The assumptions lead to the choice of a proper/the best measure of distance or similarity between the fossil sample and the training samples.
L567 These questions could already be addressed by any one of the three machine learning methods. It would be of interest what benefit there is of the superlearner/stack model approach in MEMLM compared to using a single ensemble method only. And, could WA or WAPLS contribute to the multiple ensemble approach?
L57 “to apply in” > “and carried out”
L59: “The benefit of machine learning is that it has strong data mining and information extraction ability. An associated problem,” With ability, data mining and information extraction being vague terms, I find this a sentence without meaning. With “associated problem” in the next sentence, the writers appear to me to agree that the previous sentence is problem.
L63 & L140: The reference to the fourthcorner paper (Legendre et al. 1997) appears illchosen to me. The paper is cited for “reduce [] overfitting errors” and “coexistence & and environment”, while it is a paper that focusses on traits in behavioural ecology, it is not even on environment.
L64: “This is the motivation for the ensemble learning approach we present, namely
the Multi Ensemble Machine Learning Model (MEMLM)” It is unclear to me from this sentence whether MEMLM is a new method developed and explained in this paper or an existing method. No reference is given so it looks like new/novel although it does not sound new to me. Please cite earlier work in this direction. I know for example the “super learner”approach by van der Laan (https://doi.org/10.2202/15446115.1309) and google gave me (Naimi & Balzer 2018).
L65: “Classical studies [] reconstruction approaches” cites. Norberg et al. 2019 but this paper may contain “integret[ion]”, but is not on reconstruction of palaeoenvironment.
L66: “does not attribute weights to” > “gives equal weight to”
L6772: There is an issue with this bit on what is a ‘weight’. I note that a linear multiple regression model has regression coefficients that are often referred as “weights”. A predictor (taxon) with a small regression coefficient has small weight, and a predictor with a large (absolute) coefficient has a large weight, a large influence whereas the predictors are initially unweighted (no userdefined weights). In this sense even WA and WAPLS weigh taxa differentially in their transfer function. In TWAPLS there is indeed an additional weight called the tolerance.
L74: “In MEMLM, we apply both taxon weights and model weights.” So I hoped that under Methods these weights would be clearly introduced or explained, but I did not find such explanation. Also, I would expect in Results the weight given to each model.
L74: Why do you use in MEMLM two, or even three, very similar methods. All are based on decision trees and are already ensemble methods by their own.
L79: “includes” > uses. And add the type of encoding/mention GloVe.
L81: The abbreviation NLP is not/rarely used later on, so delete.
L84: The term “environmental assemblages” is unknown to me. Change.
L86: You write “GLOVE to generate the embedding vectors of different taxa in different samples based on assemblage information” suggesting that GloVe is applied to the assemblage information, whereas on L84 you write “In environmental assemblages, there are analogous cooccurrence relationships between taxa which we hypothesize convey information on their ecological functioning” which suggest to me that GloVe is applied to a (perhaps weighted) cooccurrence matrix.
The second approach gives vectors for taxa (their meaning in your language analogy), as mentioned explicitly on L86. But: you do not describe how those vectors were transferred to vectors for samples for use in Random forest and the like (Layer one of figure 1)), except for the phrase “and then to integrate the embeddings within each sample to represent the assemblage”, but you do not describe how you did this, except that you write on L159 “the assemblages can be described as linear combinations of the features”, but you do not describe how (I guess: as in principal components analysis but more detail is needed, likely in an appendix). And whatever how you did this, describe why that is a good way (or the best way) to do it. If you followed the first approach, you applied the logbilinear model (or RC (Goodman’s rowcolum) model) to the primary training data and did not calculate a cooccurrence matrix.
Note that cooccurrence is a thing that is usually calculated from 1/0 (binary) primary data whereas you have compositional data. So this need explanation/details as well.
L91: classical reconstruction approaches > WA and WAPLS as there are more approaches than WA around, notably modern analogue methods.
L130131. Note that this uses the full data set. So the resulting RMSEP and R2 are not crossvalidation RMSEP and R2 in a formal sense. This should be remarked in the Discussion.
L146159 The Pennington et al paper is a great paper in my view. Whereas all you write here is in the paper to motivate the approach does not understandably summarize GloVe to me. GloVe is a rowcolumn bilinear model of the form (r_i + c_k + R_i*C_j) fitted to the logtransformed cooccurrence matrix derived from the primary data. It is fitted by weighted leastsquares to log(cooccurrence count +1) [so as to avoid problems with log 0]. GloVe is thereby very close to unconstrained ordination models used in ecology except for the transformation to cooccurrences , see Suppl Info in (ter Braak & te Beest 2022)nd discussion in (ter Braak 1988).
L159160: So in essence the same assemblage data is entered twice (original and transformed) as predictors in Layer one (figure 1). Apparently, the authors have little confidence that ensembles of decision trees can figure out all interesting combination of taxa…. (and perhaps they are right, except that the result of MEMLM and MEMLMc look rather similar). Actually, all differences between the MEMLM versions would show that basic machine learning methods used in layer one ‘fail’ in one way or another..
L147: I would call it a conditional probability. [but this text should be removed/replaced anyway, see before].
Section 2.2 Use the same order everywhere (in 2.2.1 and 2.2.2) from smallest to largest training data set : SWAP, NIMBIOS, SMPDSv1.
L198. Model parameters are not “Performance and validation metrics”.
L199 under > using
L199 frame > Python software library.
L199200 Delete “for … feedback”
L209 “explore to integrate” Rephrase.
L201 256 dimensions: for all data sets? With this large number of dimensions, the GloVe solution (if properly done) should almost be identical to the (possibly transformed) training assemblages.
L204 Which Python? function/file in https://github.com/Schimasuperbra/MEMLM is the rewritten function, so that we can check it?
L213 “upstream model” First usage of this term. Meaning?
L218. See earlier comment on crossvalidation and its dependence on the folds or randomness.
L223 “tests” ?> guards against?
L233 scikitlearn package. What role has this package? Is it a Python package?
L240 “The additional learning power with increasing trainingset size is evident.” Say simpler.
L244 “MEMLMe consistently under performs relative to MEMLM and MEMLMc” Say simpler.
L249 FigureA2a
L248249 Rephrase. How does this fig look like for the much small SWAP data set?
L250 Rephrase.
L253 “unsurprisingly” ? What about the danger of overfitting?
L254 Change to: as the number of training epochs is decreased from the 1000 we used to 40.
Table 1 Legend. Make sure we know for sure the R2 is also crossvalidated and does not only apply to the RMSEP (where the last P is, to me, already suggesting prediction under crossvalidation).
L268 “lowest RMSEP” but two MEMLM models in the figs. I would vote for all three MEMLM models, WACla and WAPLS (at least in a supplement).
L269 reverse WAPLS and “the best classical approach”
L270 PLS component 2 > PLS using 2 components
L276 Rephrase avoiding “understates”. Say simpler.
L280 Refer to Figure 2 earlier in the text.
Figure 6a. Please add another figure (in Suppl?) with a scatterplot matrix of the 3 (or 4) reconstructions (against one another). Similarity and dissimilarity are hard to see in Fig.6a, but may be present.
L317 inconsistent. In which sense? Rephrase.
Results section:
No info is given on the weights of the machine learning approaches in the consensus nor on their individual (truly) crossvalidatory RMSEP and R2.
Discussion:
L3253227 As you had 3 machine learning methods in your MEMLM and three version of MEMLM, it is easy for a reader to get confused. I certainly was on first reading. I missed for example on first reading that the lines in the figs are already consensus reconstructions. But not that all methods could be included…
L327 Is multiple regression a weak learner approach? In which sense? Simply: delete “a weak learner approach based on”.
L334336 “We note that the real power of embedding (dimension reduction) approaches in ecology is likely to be in their applications to much larger datasets, when ecological relationships between 10,000’s of taxa and their environment are being considered.” Likely? 1) the sentence is unclear in the sense whether it is the number of samples or the number of taxa that is “larger”. If the number of taxa is large or huge (compared to the sample size?), dimension reduction might help by reducing the number of predictors. Dimension reduction might help to reduce ‘noise’.
L349342 Say simpler. I do not believe that we want “a more complete description of a dataset”. We want robust prediction. “suggesting” why not “we show on three data sets that ….”
L357358. A “However” linking two different methods? Intentional?
L359 “felt”? See at L334336.
L362 “Even though all models were applied under the same extrapolation, the WAPLS2 reconstructions were found to be more reliable than MEMLM, although
WAPLS2 also failed to generate robust reconstructions at Villarquemado.” 1) Two although’s in the same sentence. The first should not be a “though”: OK, all models used the same data, and had thus to face the same issue with potential extrapolation (say, where in particular). 2) What is the evidence in your results or in the literature for the remark “the WAPLS2 reconstructions were found to be more reliable than MEMLM”?
L356 “ensure the reconstructions”. Now it reads as if significance testing can help making the reconstructions more robust. It can only help avoiding accepting/publishing nonrobust reconstructions.
L375 “available on request”. This is very oldfashioned. Make them available with credits where credit is due.
L380 I would like to see more code for making the paper reproducible by somebody that is knowledgeable in [both R and] Python. Add the type of software : R, Python, script or function.
Figure A1 legend. “Regression visualization” > “Scatter plots”
Table A2 What do the values represent. How are they defined? What parameter in the output is it? In which sense are they useful to the reader?
Cajo ter Braak Wageningen Oct 20. 2023.
Goodman, L.A. (1981) Association models and canonical correlation in the analysis of crossclassifications having ordered categories. Journal of the American Statistical Association, 76, 320334
Goodman, L.A. (1986) Some useful extensions of the usual correspondence analysis approach and the usual loglinear models approach in the analysis of contingency tables. International Statistical Review, 54, 243270.https://doi.org/10.2307/1403053
Goodman, L.A. (1991) Measures, models and graphical displays in the analysis of crossclassified data. JASA, 86, 10851138
Ihm, P. & van Groenewoud, H. (1984) Correspondence analysis and Gaussian ordination. Compstat Lectures, 3, 560
Naimi, A.I. & Balzer, L.B. (2018) Stacked generalization: an introduction to super learning. European Journal of Epidemiology, 33, 459464.https://doi.org/10.1007/s106540180390z
ter Braak, C.J.F. (1988) Partial canonical correspondence analysis. Classification and related methods of data analysis (ed. H.H. Bock), pp. 551558. Elsevier Science Publishers B.V. (NorthHolland) http://edepot.wur.nl/241165, Amsterdam.http://edepot.wur.nl/241165
ter Braak, C.J.F., Juggins, S., Birks, H.J.B. & Van der Voet, H. (1993) Weighted averaging partial least squares regression (WAPLS): definition and comparison with other methods for speciesenvironment calibration. Multivariate Environmental Statistics (eds G.P. Patil & C.R. Rao), pp. 525560. NorthHolland, Amsterdam.https://edepot.wur.nl/249353
ter Braak, C.J.F. & te Beest, D.E. (2022) Testing environmental effects on taxonomic composition with canonical correspondence analysis: alternative permutation tests are not equal. Environmental and Ecological Statistics, 29, 849868.https://doi.org/10.1007/s10651022005454
ter Braak, C.J.F. & van Dam, H. (1989) Inferring pH from diatoms: a comparison of old and new calibration methods. Hydrobiologia, 178, 209223.http://dx.doi.org/10.1007/BF00006028
Citation: https://doi.org/10.5194/cp202369RC1  AC1: 'Reply on RC1', Phil Holden, 15 Apr 2024

RC2: 'Comment on cp202369', Andrew Parnell, 22 Dec 2023
The paper by Sun et al is an enjoyable paper to read about the use of some newer machine learning (ML) methods applied to palaeoenvironmental reconstruction. The ML approaches are well described and the approach is mostly easy to follow. Overall I found it a bit disappointing that the authors did not take a probabilistic ML approach to the problem given that uncertainty quantification is such an important part of reconstruction. We are left with an approach that seems to simply provide a best estimate of climate change over time. A bootstrapping or Bayesian ML extension of this work would have been most welcome to better compare the approaches.
I was excited and then quite confused about how the GLOVE model was used in the process. I had never come across GLOVE before and am still unclear as to why it was chosen as an embedding method. My understanding is that these approaches are somewhat similar to PLS in that they can reduce the dimensions of the inputs in a clever way so as to capture the majority of the information. Though unlike PLS the dimension reduction seems not to be based on the response variable. I’m not really sure why it is required in this approach as none of the data sets would be considered particularly high dimensional, nor did computation seem like a barrier to performance. The GLOVE approach in particular seemed like an odd choice since it seems to throw away the data values in lieu of presence/absence (I might be reading this wrong). If so, perhaps an autoencoder or a modern dimension reduction approach such as UMAP might have been more appropriate? The argument seems to be that the GLOVE model will capture interactions between proxy values, but it is unclear to me why the treebased ML models aren’t doing this already. Perhaps the lack of benefit of using GLOVE, as measured by RMSE and R2, is because it’s not an entirely suitable approach. I certainly feel that a clearer explanation of why GLOVE is used would be helpful.
Finally, in the conclusions I think it’s really important to point out that these models have some fundamental flaws which make them not really suitable for widespread use just yet. The most obvious one to me is the lack of uncertainty quantification mentioned above, but another is the lack of a time series model being included. The autocorrelation in both the training sets and the fossil reconstruction period is usually considerable, and as this quite often changes between the calibration and fossil periods, is a really important aspect of the models.
Some less important points:
 Introduction: the term ‘space instead time’ seems grammatically incorrect and not widely used (as far as Google tells me).
 Section 2: The first paragraph repeats the last paragraph above
 Section 2.51. It’s slightly confusing to state that at the embedding dimension was set to 256 when later in the paper it’s shown that you only need ~30. Again, it’s not clear whether the GLOVE approach is being used here to capture interactions (beyond what trees will capture?) or to reduce the dimension of the problem, in which case 256 seems like overkill. A little bit more discussion would be helpful.
 Section 2.4. It would be nice to have some kind of computational speed estimates for running this models beyond just stating the hardware.
 Figure 2 (and similar figures) it’s not clear to me why the histograms for WAPLS are so different from the others. As someone who doesn’t use frequentist techniques I’d like a little more explanation of what’s happening here.
 Code availability. Please mention this in the introduction or abstract. The code is well commented, but it’s a shame there isn’t more data or instructions as to how to reproduce the results
 Table A1. As there’s a regression model here I was hoping to see something about the weights (and their uncertainties) on the different models. Do you really need all three or are some models strongly preferred over others for different data sets?
Citation: https://doi.org/10.5194/cp202369RC2  AC2: 'Reply on RC2', Phil Holden, 15 Apr 2024
Viewed
HTML  XML  Total  BibTeX  EndNote  

533  173  42  748  32  32 
 HTML: 533
 PDF: 173
 XML: 42
 Total: 748
 BibTeX: 32
 EndNote: 32
Viewed (geographical distribution)
Country  #  Views  % 

Total:  0 
HTML:  0 
PDF:  0 
XML:  0 
 1