cancer diagnosis signature; biomarker; short signature discovery; AUC stability; Random Forest reproducibility
Abstract :
[en] Abstract
Biomarker signature discovery remains the main path to developing clinical diagnostic tools when the biological knowledge on pathology is weak. Shortest signatures are often preferred to reduce the cost of the diagnostic. The ability to find the best and shortest signature relies on the robustness of the models that can be built on such a set of molecules. The classification algorithm that will be used is often selected based on the average Area Under the Curve (AUC) performance of its models. However, it is not guaranteed that an algorithm with a large AUC distribution will keep a stable performance when facing data. Here, we propose two AUC-derived hyper-stability scores, the Hyper-stability Resampling Sensitive (HRS) and the Hyper-stability Signature Sensitive (HSS), as complementary metrics to the average AUC that should bring confidence in the choice for the best classification algorithm. To emphasize the importance of these scores, we compared 15 different Random Forest implementations. Our findings show that the Random Forest implementation should be chosen according to the data at hand and the classification question being evaluated. No Random Forest implementation can be used universally for any classification and on any dataset. Each of them should be tested for their average AUC performance and AUC-derived stability, prior to analysis.
Disciplines :
Computer science
Author, co-author :
Debit, Ahmed ✱; Université de Liège - ULiège > GIGA > GIGA Cancer - Human Genetics ; Institut de Biologie de l’ENS (IBENS), Ecole Normale Superieure , 46 rue d'Ulm, 75005 Paris ,
Poulet, Christophe ✱; Université de Liège - ULiège > Département des sciences biomédicales et précliniques
Josse, Claire ; Université de Liège - ULiège > GIGA > GIGA Cancer - Human Genetics
Jerusalem, Guy ; Université de Liège - ULiège > Département des sciences cliniques > Oncologie
Azencott, Chloe-Agathe; Mines Paris, PSL Research University , CBIO-Centre for Computational Biology, 60 boulevard Saint-Michel, F-75006 Paris , ; Institut Curie, PSL Research University , 26 rue d'Ulm, F-75005 Paris , ; Inserm, U900 , 26 rue d'Ulm, F-75005 Paris ,
Bours, Vincent ; Université de Liège - ULiège > GIGA > GIGA Cancer - Human Genetics
Van Steen, Kristel ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Bioinformatique
✱ These authors have contributed equally to this work.
Language :
English
Title :
Assessing Random Forest self-reproducibility for optimal short biomarker signature discovery
H2020 - 813533 - MLFPM2018 - Machine Learning Frontiers in Precision Medicine
Name of the research project :
WalInnov-NACATS
Funders :
Région wallonne EU - European Union Marie Skłodowska-Curie Actions
Funding number :
1610125
Funding text :
This work was supported by the Région Wallonne within the project WalInnov-NACATS [1610125 to V.B.]; and the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement [813533 to K.V.S.] (MLFPM).
Commentary :
Data availability: The pipeline used in the current study is made of two components; each one was implemented and versioned separately and then having its own package. The first component concerns the feature selection and was implemented via the R package stabFS, and can be found in the Gitlab repository at https://gitlab.com/a.debit/stabfs, and available in the Figshare platform at https://doi.org/10.6084/m9.figshare.24878646.v1. The second component deals with the comparison and was implemented in the R package called compareRf. CompareRf is publicly accessible at https://gitlab.uliege.be/giga-humangenetics/tools/rpackages/comparerf. The TCGA datasets used in this research can be downloaded from https://www.cancer.gov/tcga. By providing open access to the data and source code, we aim to ensure transparency, reproducibility, and ease of collaboration within the research community.
Enroth S, Berggrund M, Lycke M. et al. High throughput proteomics identifies a high-accuracy 11 plasma protein biomarker signature for ovarian cancer. Commun Biol 2019;2:1–12. https://doi.org/10.1038/s42003-019-0464-9
Tarhini A, Kudchadkar RR. Predictive and on-treatment monitoring biomarkers in advanced melanoma: moving toward personalized medicine. Cancer Treat Rev 2018;71:8–18. https://doi.org/10.1016/j.ctrv.2018.09.005
Clinical trials on Cancer and biomarkers 2019. https://clinicaltrials.gov/
Selleck MJ, Senthil M, Wall NR. Making meaningful clinical use of biomarkers. Biomark Insights 2017;12:1177271917715236. https://doi.org/10.1177/1177271917715236
Duffy MJ, Harbeck N, Nap M. et al. Clinical use of biomarkers in breast cancer: updated guidelines from the European group on tumor markers (EGTM). Eur J Cancer 2017;75:284–98. https://doi.org/10.1016/j.ejca.2017.01.017
Frères P, Wenric S, Boukerroucha M. et al. Circulating microRNAbased screening tool for breast cancer. Oncotarget 2015;7: 5416–28. https://doi.org/10.18632/oncotarget.6786
Boylan KLM, Geschwind K, Koopmeiners JS. et al. A multiplex platform for the identification of ovarian cancer biomarkers. Clin Proteomics 2017;14:34. https://doi.org/10.1186/ s12014-017-9169-6
Gonzalez Bosquet J, Newtson AM, Chung RK. et al. Prediction of chemo-response in serous ovarian cancer. Mol Cancer 2016;15:66. https://doi.org/10.1186/s12943-016-0548-9
Breiman L. Random forests. Mach Learn 2001;45:5–32. https://doi.org/10.1023/A:1010933404324
Liaw A, Wiener M. Classification and regression by randomForest. R News 2002;2:18–22. http://CRAN.R-project.org/doc/Rnews/
Ishwaran H, Kogalur UB. Random Survival Forests for R 2007;7:25–31. https://doi.org/10.1214/08-AOAS169
Wright MN, Ziegler A. Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 2017;77:1–17. https://doi.org/10.18637/jss.v077.i01
Hothorn T, Zeileis A. Partykit: a modular toolkit for recursive Partytioning in R. J Mach Learn Res 2015;16:3905–9.
Seligman M. Rborist: extensible, parallelizable implementation of the random Forest algorithm 2019. XVI CongressoBrasileiro De Engenharia Ciencias Dos Materiais 2017;22:66–75.
Simm J, Abril IMD, Sugiyama M. Tree-based ensemble multi-task learning method for classification and regression. IEICE Trans Inf Syst 2014;E97.D:1677–81. https://doi.org/10.1587/transinf.E97.D.1677
Ciss S. Random Uniform Forests. 2015, hal-01104340, version 2.
Deng H, Runger G. “Feature selection via regularized trees,” The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, QLD, Australia, 2012, pp. 1–8. https://doi.org/10.1109/IJCNN.2012.6252640
Zhao H, Williams GJ, Huang JZ. Wsrf: an R package for classification with scalable weighted subspace random forests. J Stat Softw 2017;77:1–30. https://doi.org/10.18637/jss.v077.i03
Basu S, Kumbier K, Brown JB. et al. Iterative random forests to discover predictive and stable high-order interactions. Proc Natl Acad Sci USA 2018;115:1943–8. https://doi.org/10.1073/pnas.1711236115
da Silva N, Cook D, Lee E-K. A projection pursuit Forest algorithm for supervised classification. J Comput Graph Stat 2021;30: 1168–80. https://doi.org/10.1080/10618600.2020.1870480
Menze BH, Kelm BM, Splitthoff DN. et al. On oblique random forests. In machine learning and knowledge discovery in databases. In: Gunopulos D, Hofmann T, Malerba D. et al. (eds.), Lecture Notes in Computer Science, Vol. 6912, pp. 453–69. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011 ISBN 978-3-642-23782-9.
Ballings M, Van den Poel D. rotationForest: Fit and Deploy Rotation Forest Models, 2015. https://CRAN.R-project.org/package= rotationForest
Browne J, Tomita T. Rerf: Randomer Forest. 2019. https://CRAN.R-project.org/package=rerf
Wan Y-W, Allen GI, Liu Z. TCGA2STAT: simple TCGA data access for integrated statistical analysis in R. Bioinformatics 2016;32: 952–4. https://doi.org/10.1093/bioinformatics/btv677
Alelyani S, Liu H, Wang L. The Effect of the Characteristics of the Dataset on the Selection Stability. In Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence (ICTAI’11). IEEE Computer Society, USA, 2011;970–977. https://doi.org/10.1109/ICTAI.2011.167
Kuhn M. Building predictive models in R using the caret package. J Stat Softw 2008;28:1–26. https://doi.org/10.18637/jss.v028.i05
Yan Y. MLmetrics: Machine Learning Evaluation Metrics. R package version 1.1.3, 2025. https://github.com/yanyachen/ mlmetrics
Pretorius A, Bierman S, Steel SJ. A meta-analysis of research in random forests for classification. In: Proceedings of the 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), pp. 1–6, 2016. https://doi.org/10.1109/RoboMech.2016.7813171
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008;9:559. https://doi.org/10.1186/1471-2105-9-559
Langfelder P, Luo R, Oldham MC. et al. Is my network module preserved and reproducible? PLoS Comput Biol 2011;7:e1001057. https://doi.org/10.1371/journal.pcbi.1001057
Lobo JM, Jiménez-Valverde A, Real R. AUC: a misleading measure of the performance of predictive distribution models. Glob Ecol Biogeogr 2008;17:145–51. https://doi.org/10.1111/j.1466-8238.2007.00358.x
Wu Y-C, Lee W-C. Alternative performance measures for prediction models. PLoS One 2014;9:e91249. https://doi.org/10.1371/journal.pone.0091249
Hanczar B, Hua J, Sima C. et al. Small-sample precision of ROC-related estimates. Bioinformatics 2010;26:822–30. https://doi.org/10.1093/bioinformatics/btq037
Padhye S, Sahu RA, Saraswat V. Introduction to Cryptography (1st ed.). CRC Press. 2018. https://doi.org/10.1201/9781315114590
Pretorius A. Advances in random forests with application to classification (Doctoral dissertation, Stellenbosch: Stellenbosch University). 2016.
Probst P, Boulesteix A-L. To tune or not to tune the number of trees in random Forest. Journal of Machine Learning Research, 18:1–18.
Hédou J, Marić I, Bellan G. et al. Discovery of sparse, reliable omic biomarkers with Stabl. Nat Biotechnol 2024;42:1581–93. https://doi.org/10.1038/s41587-023-02033-x
Lee YD, Cook D, Park J. et al. PPtree: projection pursuit classification tree. Electron J Statist 2013;7:1369–86. https://doi.org/10.1214/13-EJS810
Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci 2006;103:5923–8. https://doi.org/10.1073/pnas.0601231103
Chao C, Andy L, Leo B. Using Random Forest to Learn Imbalanced Data. University of California, Berkeley 2004; Technical Report: 666. http://xtf.lib.berkeley.edu/reports/SDTRWebData/accessPages/666.html
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015;10:e0118432. https://doi.org/10.1371/journal.pone.0118432
Weng CG, Poon J. A New Evaluation Measure for Imbalanced Datasets. In Roddick JF, Li J, Christen P, Kennedy PJ. (eds), Proc. Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South Australia. CRPIT, 87. 2008. ACS. 27–32.
Floares AG, Ferisgan MARIUS, Onita DANIELA, Ciuparu A, Calin GA, Manolache FB. The smallest sample size for the desired diagnosis accuracy. Int J Oncol Cancer Ther, 2017;2:13–19.
Han Y, Yu L. A variance reduction framework for stable feature selection. Stat. Anal. Data Min. ASA Data Sci. J. 2012;5:428–445.
Janitza S, Hornung R. On the overestimation of random Forest’s out-of-bag error. PLoS One 2018;13:e0201904. https://doi.org/10.1371/journal.pone.0201904
Toloşi L, Lengauer T. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 2011;27:1986–94. https://doi.org/10.1093/bioinformatics/ btr300
Statnikov A, Aliferis CF. Analysis and computational dissection of molecular signature multiplicity. PLoS Comput Biol 2010;6:e1000790. https://doi.org/10.1371/journal.pcbi.1000790