No document available.
Abstract :
[en] In the quest of making multidimensional chromatography a robust method for untargeted screening of small molecules, one of the key remaining challenges to tackle is reproducibility. To reach this objective, some analytical aspects need to be investigated, such as column dimension and separation conditions. The biggest challenge is the data processing step going from row data to information extraction. To enable data analytical methods and processing workflow evaluation, a reference data set is required. In this study, we used a whole stool research grade test materials (RGTMs) from NIST to develop a control data set covering sampling, analysis, and processing workflows comparison. These RGTMs have been designed to conduct an inter-method and interlaboratory study on whole stool samples to develop a standard reference material. The RGTMs contain two diets, vegan and omnivore, and two sample preparation, liquid vs lyophilized. In this presentation, we will focus on the utilization of this data set to evaluate data processing approaches.
The robustness of several workflow involving commercial, in house and open-source solutions were investigated. First, we investigated user impact on a well-established ANOVA-based workflow. The goal was to evaluate the weight of human decisions on the final classification metrics and the significant features identified. Next, we developed and evaluated a new processing approach combining tile-based image comparison and machine learning-based feature selection.
For the user impact study, our well-established workflow has proven to be robust to human decision. Indeed, human decision during the data cleaning, the pre-processing and the model building did not affect the global output of the study. For our new processing approach, the combination of tile-based alignment and random forest classification increases the robustness compared to the ANOVA-based approach. Indeed, the false positive rate decreased during feature selection, and we were able to conduct unbalanced data set processing.