the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Can machine learning algorithms improve upon classical palaeoenvironmental reconstruction models?
Peng Sun
Philip B. Holden
H. John B. Birks
Abstract. Classical palaeoenvironmental reconstruction models often incorporate biological ideas and commonly assume that the taxa comprising a fossil assemblage exhibit unimodal response functions of the environmental variable of interest. In contrast, machine learning approaches do not rely upon any biological assumptions, but instead need training with large data-sets to extract some understanding of the relationships between biological assemblages and their environment. We have developed a two-layered machine learning reconstruction model MEMLM (Multi Ensemble Machine Learning Model). The first layer applies three different ensemble machine learning models of random forests, extra random trees and lightGBM, trained on the modern taxon assemblage and associated environmental data to make reconstructions based on the three different models, while the second layer uses multiple linear regression to integrate these three reconstructions into a consensus reconstruction. We consider three versions of the model: 1) A standard version of MEMLM, which uses only taxon abundance data, 2) MEMLMe, which uses embedded assemblage information, using a natural language processing model (GLOVE) to detect associations between taxa across the training data-set and 3) MEMLMc which incorporates both taxon abundance and assemblage data. We train these MEMLM model variants with three high quality diatom and pollen training sets and compare their reconstruction performance with three weighted averaging (WA) approaches of WA-Cla (classical deshrinking), WA-Inv (inverse deshrinking) and WA-PLS (partial least squares). In general, the MEMLM approaches, even when trained on only embedded assemblage data, perform substantially better than the WA approaches under cross-validation in the larger data-sets. However, when applied to fossil data, MEMLM and WA approaches sometimes generate qualitatively different palaeoenvironmental reconstructions. We applied a statistical significance test to all the reconstructions. This successfully identified each incidence where the reconstruction is not robust with respect to the model choice. We find that machine learning approaches can outperform classical approaches, but can sometimes catastrophically fail, despite showing high performance under cross-validation, likely indicating problems when extrapolation occurs. We find that the classical approaches are generally more robust, although they can also generate reconstructions which have modest statistical significance, and therefore may be unreliable. We conclude that cross-validation is not a sufficient measure of transfer-function performance, and we recommend that the results of statistical significance tests are provided alongside the down-core reconstructions based on fossil assemblages.
- Preprint
(2986 KB) - Metadata XML
- BibTeX
- EndNote
Peng Sun et al.
Status: open (extended)
-
RC1: 'Comment on cp-2023-69', Cajo ter Braak, 20 Oct 2023
reply
General:
The purpose of the paper is unclear to me; it should perhaps be more focussed. One simple (?) purpose could be to demonstrate that particular machine learning methods can give more predictive power than classical approaches (analogue methods, WA, WA-PLS, fxTWA-PLS). Another one might be to show that combining/stacking predictors as in MEMLM improves prediction or robustness compared to using a single machine learning method (perhaps with the previous purpose included as secondary purpose). The current version does not give info on the second option (what is the utility of stacking) and immediately compares (three!) different MEMLM versions with WA and WA-PLS. So the possible advantages of combining the predictions of two (or three) rather similar machine learning method remains unanswered. I would be vote for adding this info to the MS.
The reduced RMSEP using MEMLM in the bigger data sets is impressive. The different reconstructions that different versions generate signal issues. There is no attempt to provide standard errors for the (WA or consensus) reconstructions.
The different versions of MEMLM (and of WA?) may generate qualitatively different reconstructions while have similar RMSEP on the training data (Fig.4 gives a nice example with same trend but different level, why?). Could other statistics of performance that are in current usage in WA-PLS like average bias and maximum bias and, in Liu et al 2020/2023, the regression slope, help to detect such issues? Did the different versions have different weights in their component predictors?
The paper uses five-fold cross validation without further specification, so presumably using random folds. I note that cross validation gives results, such as RMSEP, that depends on the random or chosen folds, and so gives variable results on re-application. What is the error in RMSEP, so how should the reader interpret 9% improvement of one method compared to another. I note that the error is larger the smaller the number of folds. With a larger number of folds, the error is smaller but then geographic nearness of training samples is an issue (due to spatial auto-correlation giving pseudo replication) and should be avoided, done for example in Liu et al 2020/2023.
The application of GloVe (in either way I think you did it, see detailed comment on L86) is similar to an old way of doing reconstruction. First apply a correspondence analysis to the training data (or a principal components analysis to the training data or to the covariance matrix or innerproduct matrix obtained from the training data, which similar to the co-occurrence matrix used in GloVe) and then use the dimensions, i.e. the sample scores of the various axes, in a second step (in this paper called (integrated) embedding vectors), which used subsequently in a multiple linear regression step (similar to the random forest in the paper). See Roux 1979 cited in (ter Braak & van Dam 1989). If the full dimensionality, is used (instead of using only the first few dimensions), this type of approach should not be worse than one using the training data directly, as the embedding is/should be contain a full representation of the training data. In summary: the GloVE approach does an unconstrained ordination of the training assemblage data (i.e. without environment data) and uses the dimensions (ordination axes / embedding vectors) so obtained for supervised learning of environment as a function of ordination axes. In my view, this approach is not likely to be superior so more simple direct approaches.
I note that the GloVe model is log-linear with free row an column parameters, so that is an RC-model in the sense of Goodman ((Goodman 1981; Ihm & van Groenewoud 1984; Goodman 1986; Goodman 1991)). Correspondence analysis is close to a log-linear model (Supporting Information to (ter Braak & te Beest 2022)) and both models are close to the Gaussian model (with equal tolerances) that was the basic motivation for WA and WA-PLS.
GloVe uses the log(co-occurrences) for good reasons. Did you consider in the other approaches transformations of the assemblage taxon proportions? Could that not be handled naturally in your multiple ensemble approach?
Consider adding some motivation why you do not consider modern-analogue methods or, alternatively, limit your conclusions.
The Discussion also needs attention. I found a number of remarks that appear to lack (precise) support. Some of these may only be answered using simulated data as in (ter Braak et al. 1993).
Details:
Abstract
You use twice constucts like “three models of A, B and C”. I vote for “three models (A, B and C)”.
L19 “MEMLMe, which uses embedded assemblage information” The intended audience is not likely to know what embedded assemblage information means. Even a google search would not be of much help, I believe. Something with dimension reduction or unconstrained ordination of co-occurrence information might be more helpful.
L20 “MEMLMc which incorporates both taxon abundance and assemblage data.” What is the difference between taxon abundance data and assemblage data? Unclear.
L24 “embedded assemblage information” See L19.
L25 Why “However”?
L26 Make more clear that the different version of MEMLM also generated qualitatively different reconstructions.
L26 Here you switch from present to past tense. Put all in past tense.
L28 “catastrophically fail” Where did you find this and why this emotional term?
Introduction
L50: should -> could
L51 Delete “However,”
L51-54 Optimal methods and algorithms can be derived by making assumptions. The GloVe paper by Pennington et al gives a nice example! But, algorithms themselves never make assumptions (except for data properties, e.g. they may fail to work on negative data or symbolic data).A particular algorithm is only motivated from/ derived from assumptions. For example, taking the mean does not assume a sample from a normal distribution (even for P/A data and Poissonian counts the mean is optimal!), but it is a fine summary of the location of data when such assumptions hold true. WA has been derived from the unimodal model for the ecological niches of taxa (with equal niche breadths), but it does not assume such models; it may well work well for other (strictly compositional) data, but one does not know whether it is best in some sense. Even modern analogue methods are ‘based on’ assumptions. The assumptions lead to the choice of a proper/the best measure of distance or similarity between the fossil sample and the training samples.
L56-7 These questions could already be addressed by any one of the three machine learning methods. It would be of interest what benefit there is of the superlearner/stack model approach in MEMLM compared to using a single ensemble method only. And, could WA or WA-PLS contribute to the multiple ensemble approach?
L57 “to apply in” -> “and carried out”
L59: “The benefit of machine learning is that it has strong data mining and information extraction ability. An associated problem,” With ability, data mining and information extraction being vague terms, I find this a sentence without meaning. With “associated problem” in the next sentence, the writers appear to me to agree that the previous sentence is problem.
L63 & L140: The reference to the fourth-corner paper (Legendre et al. 1997) appears ill-chosen to me. The paper is cited for “reduce [] over-fitting errors” and “co-existence & and environment”, while it is a paper that focusses on traits in behavioural ecology, it is not even on environment.
L64: “This is the motivation for the ensemble learning approach we present, namely
the Multi Ensemble Machine Learning Model (MEMLM)” It is unclear to me from this sentence whether MEMLM is a new method developed and explained in this paper or an existing method. No reference is given so it looks like new/novel although it does not sound new to me. Please cite earlier work in this direction. I know for example the “super learner”approach by van der Laan (https://doi.org/10.2202/1544-6115.1309) and google gave me (Naimi & Balzer 2018).
L65: “Classical studies [] reconstruction approaches” cites. Norberg et al. 2019 but this paper may contain “integret[ion]”, but is not on reconstruction of palaeoenvironment.
L66: “does not attribute weights to” -> “gives equal weight to”
L67-72: There is an issue with this bit on what is a ‘weight’. I note that a linear multiple regression model has regression coefficients that are often referred as “weights”. A predictor (taxon) with a small regression coefficient has small weight, and a predictor with a large (absolute) coefficient has a large weight, a large influence whereas the predictors are initially unweighted (no user-defined weights). In this sense even WA and WA-PLS weigh taxa differentially in their transfer function. In TWA-PLS there is indeed an additional weight called the tolerance.
L74: “In MEMLM, we apply both taxon weights and model weights.” So I hoped that under Methods these weights would be clearly introduced or explained, but I did not find such explanation. Also, I would expect in Results the weight given to each model.
L74: Why do you use in MEMLM two, or even three, very similar methods. All are based on decision trees and are already ensemble methods by their own.
L79: “includes” -> uses. And add the type of encoding/mention GloVe.
L81: The abbreviation NLP is not/rarely used later on, so delete.
L84: The term “environmental assemblages” is unknown to me. Change.
L86: You write “GLOVE to generate the embedding vectors of different taxa in different samples based on assemblage information” suggesting that GloVe is applied to the assemblage information, whereas on L84 you write “In environmental assemblages, there are analogous co-occurrence relationships between taxa which we hypothesize convey information on their ecological functioning” which suggest to me that GloVe is applied to a (perhaps weighted) co-occurrence matrix.
The second approach gives vectors for taxa (their meaning in your language analogy), as mentioned explicitly on L86. But: you do not describe how those vectors were transferred to vectors for samples for use in Random forest and the like (Layer one of figure 1)), except for the phrase “and then to integrate the embeddings within each sample to represent the assemblage”, but you do not describe how you did this, except that you write on L159 “the assemblages can be described as linear combinations of the features”, but you do not describe how (I guess: as in principal components analysis but more detail is needed, likely in an appendix). And whatever how you did this, describe why that is a good way (or the best way) to do it. If you followed the first approach, you applied the log-bilinear model (or RC (Goodman’s row-colum) model) to the primary training data and did not calculate a co-occurrence matrix.
Note that co-occurrence is a thing that is usually calculated from 1/0 (binary) primary data whereas you have compositional data. So this need explanation/details as well.
L91: classical reconstruction approaches -> WA and WA-PLS as there are more approaches than WA around, notably modern analogue methods.
L130-131. Note that this uses the full data set. So the resulting RMSEP and R2 are not crossvalidation RMSEP and R2 in a formal sense. This should be remarked in the Discussion.
L146-159 The Pennington et al paper is a great paper in my view. Whereas all you write here is in the paper to motivate the approach does not understandably summarize GloVe to me. GloVe is a row-column bilinear model of the form (r_i + c_k + R_i*C_j) fitted to the log-transformed co-occurrence matrix derived from the primary data. It is fitted by weighted least-squares to log(co-occurrence count +1) [so as to avoid problems with log 0]. GloVe is thereby very close to unconstrained ordination models used in ecology except for the transformation to co-occurrences , see Suppl Info in (ter Braak & te Beest 2022)nd discussion in (ter Braak 1988).
L159-160: So in essence the same assemblage data is entered twice (original and transformed) as predictors in Layer one (figure 1). Apparently, the authors have little confidence that ensembles of decision trees can figure out all interesting combination of taxa…. (and perhaps they are right, except that the result of MEMLM and MEMLMc look rather similar). Actually, all differences between the MEMLM versions would show that basic machine learning methods used in layer one ‘fail’ in one way or another..
L147: I would call it a conditional probability. [but this text should be removed/replaced anyway, see before].
Section 2.2 Use the same order everywhere (in 2.2.1 and 2.2.2) from smallest to largest training data set : SWAP, NIMBIOS, SMPDSv1.
L198. Model parameters are not “Performance and validation metrics”.
L199 under -> using
L199 frame -> Python software library.
L199-200 Delete “for … feedback”
L209 “explore to integrate” Rephrase.
L201 256 dimensions: for all data sets? With this large number of dimensions, the GloVe solution (if properly done) should almost be identical to the (possibly transformed) training assemblages.
L204 Which Python? function/file in https://github.com/Schimasuperbra/MEMLM is the rewritten function, so that we can check it?
L213 “upstream model” First usage of this term. Meaning?
L218. See earlier comment on cross-validation and its dependence on the folds or randomness.
L223 “tests” ?-> guards against?
L233 scikit-learn package. What role has this package? Is it a Python package?
L240 “The additional learning power with increasing training-set size is evident.” Say simpler.
L244 “MEMLMe consistently under- performs relative to MEMLM and MEMLMc” Say simpler.
L249 FigureA2a
L248-249 Rephrase. How does this fig look like for the much small SWAP data set?
L250 Rephrase.
L253 “unsurprisingly” ? What about the danger of overfitting?
L254 Change to: as the number of training epochs is decreased from the 1000 we used to 40.
Table 1 Legend. Make sure we know for sure the R2 is also cross-validated and does not only apply to the RMSEP (where the last P is, to me, already suggesting prediction under cross-validation).
L268 “lowest RMSEP” but two MEMLM models in the figs. I would vote for all three MEMLM models, WA-Cla and WA-PLS (at least in a supplement).
L269 reverse WA-PLS and “the best classical approach”
L270 PLS component 2 -> PLS using 2 components
L276 Rephrase avoiding “understates”. Say simpler.
L280 Refer to Figure 2 earlier in the text.
Figure 6a. Please add another figure (in Suppl?) with a scatterplot matrix of the 3 (or 4) reconstructions (against one another). Similarity and dissimilarity are hard to see in Fig.6a, but may be present.
L317 inconsistent. In which sense? Rephrase.
Results section:
No info is given on the weights of the machine learning approaches in the consensus nor on their individual (truly) crossvalidatory RMSEP and R2.
Discussion:
L325-3227 As you had 3 machine learning methods in your MEMLM and three version of MEMLM, it is easy for a reader to get confused. I certainly was on first reading. I missed for example on first reading that the lines in the figs are already consensus reconstructions. But not that all methods could be included…
L327 Is multiple regression a weak learner approach? In which sense? Simply: delete “a weak learner approach based on”.
L334-336 “We note that the real power of embedding (dimension reduction) approaches in ecology is likely to be in their applications to much larger data-sets, when ecological relationships between 10,000’s of taxa and their environment are being considered.” Likely? 1) the sentence is unclear in the sense whether it is the number of samples or the number of taxa that is “larger”. If the number of taxa is large or huge (compared to the sample size?), dimension reduction might help by reducing the number of predictors. Dimension reduction might help to reduce ‘noise’.
L349-342 Say simpler. I do not believe that we want “a more complete description of a data-set”. We want robust prediction. “suggesting” why not “we show on three data sets that ….”
L357-358. A “However” linking two different methods? Intentional?
L359 “felt”? See at L334-336.
L362 “Even though all models were applied under the same extrapolation, the WA-PLS2 reconstructions were found to be more reliable than MEMLM, although
WA-PLS2 also failed to generate robust reconstructions at Villarquemado.” 1) Two although’s in the same sentence. The first should not be a “though”: OK, all models used the same data, and had thus to face the same issue with potential extrapolation (say, where in particular). 2) What is the evidence in your results or in the literature for the remark “the WA-PLS2 reconstructions were found to be more reliable than MEMLM”?
L356 “ensure the reconstructions”. Now it reads as if significance testing can help making the reconstructions more robust. It can only help avoiding accepting/publishing non-robust reconstructions.
L375 “available on request”. This is very old-fashioned. Make them available with credits where credit is due.
L380 I would like to see more code for making the paper reproducible by somebody that is knowledgeable in [both R and] Python. Add the type of software : R, Python, script or function.
Figure A1 legend. “Regression visualization” -> “Scatter plots”
Table A2 What do the values represent. How are they defined? What parameter in the output is it? In which sense are they useful to the reader?
Cajo ter Braak Wageningen Oct 20. 2023.
Goodman, L.A. (1981) Association models and canonical correlation in the analysis of cross-classifications having ordered categories. Journal of the American Statistical Association, 76, 320-334
Goodman, L.A. (1986) Some useful extensions of the usual correspondence analysis approach and the usual log-linear models approach in the analysis of contingency tables. International Statistical Review, 54, 243-270.https://doi.org/10.2307/1403053
Goodman, L.A. (1991) Measures, models and graphical displays in the analysis of cross-classified data. JASA, 86, 1085-1138
Ihm, P. & van Groenewoud, H. (1984) Correspondence analysis and Gaussian ordination. Compstat Lectures, 3, 5-60
Naimi, A.I. & Balzer, L.B. (2018) Stacked generalization: an introduction to super learning. European Journal of Epidemiology, 33, 459-464.https://doi.org/10.1007/s10654-018-0390-z
ter Braak, C.J.F. (1988) Partial canonical correspondence analysis. Classification and related methods of data analysis (ed. H.H. Bock), pp. 551-558. Elsevier Science Publishers B.V. (North-Holland) http://edepot.wur.nl/241165, Amsterdam.http://edepot.wur.nl/241165
ter Braak, C.J.F., Juggins, S., Birks, H.J.B. & Van der Voet, H. (1993) Weighted averaging partial least squares regression (WA-PLS): definition and comparison with other methods for species-environment calibration. Multivariate Environmental Statistics (eds G.P. Patil & C.R. Rao), pp. 525-560. North-Holland, Amsterdam.https://edepot.wur.nl/249353
ter Braak, C.J.F. & te Beest, D.E. (2022) Testing environmental effects on taxonomic composition with canonical correspondence analysis: alternative permutation tests are not equal. Environmental and Ecological Statistics, 29, 849-868.https://doi.org/10.1007/s10651-022-00545-4
ter Braak, C.J.F. & van Dam, H. (1989) Inferring pH from diatoms: a comparison of old and new calibration methods. Hydrobiologia, 178, 209-223.http://dx.doi.org/10.1007/BF00006028
Citation: https://doi.org/10.5194/cp-2023-69-RC1
Peng Sun et al.
Peng Sun et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
278 | 79 | 7 | 364 | 3 | 6 |
- HTML: 278
- PDF: 79
- XML: 7
- Total: 364
- BibTeX: 3
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1