Abstract :
[en] Abstract
Introduction The pre-processing of analytical data in metabolomics must be considered as a whole to allow the construction
of a global and unique object for any further simultaneous data analysis or multivariate statistical modelling. For 1D
1H-NMR metabolomics experiments, best practices for data pre-processing are well defined, but not yet for 2D experiments
(for instance COSY in this paper).
Objective By considering the added value of a second dimension, the objective is to propose two workflows dedicated to
2D NMR data handling and preparation (the Global Peak List and Vectorization approaches) and to compare them (with
respect to each other and with 1D standards). This will allow to detect which methodology is the best in terms of amount of
metabolomic content and to explore the advantages of the selected workflow in distinguishing among treatment groups and
identifying relevant biomarkers. Therefore, this paper explores both the necessity of novel 2D pre-processing workflows,
the evaluation of their quality and the evaluation of their performance in the subsequent determination of accurate (2D)
biomarkers.
Methods To select the more informative data source, MIC (Metabolomic Informative Content) indexes are used, based on
clustering and inertia measures of quality. Then, to highlight biomarkers or critical spectral zones, the PLS-DA model is
used, along with more advanced sparse algorithms (sPLS and L-sOPLS).
Results Results are discussed according to two different experimental designs (one which is unsupervised and based on
human urine samples, and the other which is controlled and based on spiked serum media). MIC indexes are shown, leading
to the choice of the more relevant workflow to use thereafter. Finally, biomarkers are provided for each case and the predictive
power of each candidate model is assessed with cross-validated measures of RMSEP.
Conclusion In conclusion, it is shown that no solution can be universally the best in every case, but that 2D experiments
allow to clearly find relevant cross peak biomarkers even with a poor initial separability between groups. The MIC measures
linked with the candidate workflows (2D GPL, 2D vectorization, 1D, and with specific parameters) lead to visualize which
data set must be used as a priority to more easily find biomarkers. The diversity of data sources, mainly 1D versus 2D, may
often lead to complementary or confirmatory results.
Scopus citations®
without self-citations
3