Article (Scientific journals)
Assessing Random Forest self-reproducibility for optimal short biomarker signature discovery
Debit, Ahmed; Poulet, Christophe; Josse, Claire et al.
2025In Briefings in Bioinformatics, 26 (4)
Peer Reviewed verified by ORBi
 

Files


Full Text
bbaf318-2.pdf
Publisher postprint (1.9 MB) Creative Commons License - Attribution
Download
Annexes
file_s1_updated_bbaf318.pdf
(3.93 MB) Creative Commons License - Attribution
Download
file_s2_bbaf318.xlsx
(25.75 kB) Creative Commons License - Attribution
Download

All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
cancer diagnosis signature; biomarker; short signature discovery; AUC stability; Random Forest reproducibility
Abstract :
[en] Abstract Biomarker signature discovery remains the main path to developing clinical diagnostic tools when the biological knowledge on pathology is weak. Shortest signatures are often preferred to reduce the cost of the diagnostic. The ability to find the best and shortest signature relies on the robustness of the models that can be built on such a set of molecules. The classification algorithm that will be used is often selected based on the average Area Under the Curve (AUC) performance of its models. However, it is not guaranteed that an algorithm with a large AUC distribution will keep a stable performance when facing data. Here, we propose two AUC-derived hyper-stability scores, the Hyper-stability Resampling Sensitive (HRS) and the Hyper-stability Signature Sensitive (HSS), as complementary metrics to the average AUC that should bring confidence in the choice for the best classification algorithm. To emphasize the importance of these scores, we compared 15 different Random Forest implementations. Our findings show that the Random Forest implementation should be chosen according to the data at hand and the classification question being evaluated. No Random Forest implementation can be used universally for any classification and on any dataset. Each of them should be tested for their average AUC performance and AUC-derived stability, prior to analysis.
Disciplines :
Computer science
Author, co-author :
Debit, Ahmed   ;  Université de Liège - ULiège > GIGA > GIGA Cancer - Human Genetics ; Institut de Biologie de l’ENS (IBENS), Ecole Normale Superieure , 46 rue d'Ulm, 75005 Paris ,
Poulet, Christophe   ;  Université de Liège - ULiège > Département des sciences biomédicales et précliniques
Josse, Claire  ;  Université de Liège - ULiège > GIGA > GIGA Cancer - Human Genetics
Jerusalem, Guy  ;  Université de Liège - ULiège > Département des sciences cliniques > Oncologie
Azencott, Chloe-Agathe;  Mines Paris, PSL Research University , CBIO-Centre for Computational Biology, 60 boulevard Saint-Michel, F-75006 Paris , ; Institut Curie, PSL Research University , 26 rue d'Ulm, F-75005 Paris , ; Inserm, U900 , 26 rue d'Ulm, F-75005 Paris ,
Bours, Vincent ;  Université de Liège - ULiège > GIGA > GIGA Cancer - Human Genetics
Van Steen, Kristel  ;  Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Bioinformatique
 These authors have contributed equally to this work.
Language :
English
Title :
Assessing Random Forest self-reproducibility for optimal short biomarker signature discovery
Publication date :
July 2025
Journal title :
Briefings in Bioinformatics
ISSN :
1467-5463
eISSN :
1477-4054
Publisher :
Oxford University Press
Volume :
26
Issue :
4
Peer reviewed :
Peer Reviewed verified by ORBi
European Projects :
H2020 - 813533 - MLFPM2018 - Machine Learning Frontiers in Precision Medicine
Name of the research project :
WalInnov-NACATS
Funders :
Région wallonne
EU - European Union
Marie Skłodowska-Curie Actions
Funding number :
1610125
Funding text :
This work was supported by the Région Wallonne within the project WalInnov-NACATS [1610125 to V.B.]; and the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement [813533 to K.V.S.] (MLFPM).
Commentary :
Data availability: The pipeline used in the current study is made of two components; each one was implemented and versioned separately and then having its own package. The first component concerns the feature selection and was implemented via the R package stabFS, and can be found in the Gitlab repository at https://gitlab.com/a.debit/stabfs, and available in the Figshare platform at https://doi.org/10.6084/m9.figshare.24878646.v1. The second component deals with the comparison and was implemented in the R package called compareRf. CompareRf is publicly accessible at https://gitlab.uliege.be/giga-humangenetics/tools/rpackages/comparerf. The TCGA datasets used in this research can be downloaded from https://www.cancer.gov/tcga. By providing open access to the data and source code, we aim to ensure transparency, reproducibility, and ease of collaboration within the research community.
Available on ORBi :
since 14 August 2025

Statistics


Number of views
41 (7 by ULiège)
Number of downloads
32 (2 by ULiège)

Scopus citations®
 
0
Scopus citations®
without self-citations
0
OpenCitations
 
0
OpenAlex citations
 
0

Bibliography


Similar publications



Contact ORBi