[en] In an interesting and quite exhaustive review on Random Forests (RF) methodology in bioinformatics Touw et al. address-among other topics-the problem of the detection of interactions between variables based on RF methodology. We feel that some important statistical concepts, such as 'interaction', 'conditional dependence' or 'correlation', are sometimes employed inconsistently in the bioinformatics literature in general and in the literature on RF in particular. In this letter to the Editor, we aim to clarify some of the central statistical concepts and point out some confusing interpretations concerning RF given by Touw et al. and other authors.
Disciplines :
Sciences du vivant: Multidisciplinaire, généralités & autres
Auteur, co-auteur :
Boulesteix, Anne-Laure
Janitza, Silke
Hapfelmeier, Alexander
Van Steen, Kristel ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Bioinformatique
Strobl, Carolin
Langue du document :
Anglais
Titre :
Letter to the Editor: On the term 'interaction' and related phrases in the literature on Random Forests.
Touw WG, Bayjanov JR, Overmars L, et al. Data mining in the life sciences with random forest: a walk in the park or lost in the jungle? Brief Bioinform 2013;14:315-26.
Kim Y, Wojciechowski R, Sung H, et al. Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects. BMC Proceedings 2009;3:S64.
Kelly C, Okada K. Variable interaction measures with random forest classifiers. 9th IEEE International Symposium on Biomedical Imaging (ISBI) 2012;154-7.
Miettinen OS. Theoretical Epidemiology: Principles ofOccurrence Research inMedicine. New York: Wiley, 1985.
Grobbee DE, Hoes AW. Clinical Epidemiology: Principles, Methods, and Applications for Clinical Research. London: Jones & Bartlett Learning, 2009.
Fisher DL, Rizzo M, Caird J, et al. Handbook of Driving Simulation for Engineering, Medicine, and Psychology. Florida: CRC Press, 2011.
Cordell HJ. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet 2002;11:2463-8.
Fisher RA. XV-the correlation between relatives on the supposition of Mendelian inheritance. TRSoc Edin 1918;52: 399-433.
Tutz G. Regression for Categorical Data. New York: Cambridge University Press, 2012.
Winham S, Colby C, Freimuth R, et al. SNP interaction detection with random forests in high-dimensional genetic data. BMC Bioinformatics 2012;13:164.
Moore HA. A global view of epistasis. Nat Genet 2005;37: 13-14.
Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. JComputGraph Stat 2006;15:651-74.
Breiman L, Friedman JH, Olshen RA, et al. Classification and RegressionTrees. New York: Chapman & Hall, 1984.
Strobl C, Boulesteix AL, Kneib T, et al. Conditional variable importance for random forests. BMC Bioinformatics 2008;9:307.
St Laurent G, Tackett MR, Nechkin S, et al. Genome-wide analysis of A-to-I RNA editing by single-molecule sequencing in drosophila. Nat StructMol Biol 2013;20:1333-9.
Strobl C, Boulesteix AL, Zeileis A, et al. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 2007;8:25.
Nicodemus KK. Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures. Brief Bioinform 2011;12:369-73.
Boulesteix AL, Bender A, Lorenzo Bermejo J, et al. Random forest Gini importance favours SNPs with large minor allele frequency: assessment, sources and recommendations. Brief Bioinform 2012;13:292-304.
Azen R, Budescu CV, Reiser B. Criticality of predictors in multiple regression. BritJMath Stat Psy 2001;54:201-25.
Breiman L, Cutler A. Random Forests-original implementation. http://www.stat.berkeley.edu/~breiman/RandomForests/, 2004 (24 March 2014, date last accessed).
Liaw A, Wiener M. Breiman and Cutler's Random Forests for Classification and Regression, 2012. URL http://CRAN.R-project.org/. R package version 4.6-7. (24 March 2014, date last accessed).
Strobl C, Malley JD, Tutz G. An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. PsycholMethods 2009;14:323-48.
Boulesteix AL, Janitza S, Kruppa J, et al. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscipl Rev Data Mining Knowl Discov 2012;2: 493-507.
Hastie T, Tibshirani R, Friedman JJH. The Elements of Statistical Learning. 2nd edn. New York: Springer, 2001.
Di{dotless}az-Uriarte R, Alvarez De AS. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006;7:3.
Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008;9:319.
Grömping U. Variable importance assessment in regression: linear regression versus random forest. Am Stat 2009;63: 308-19.
Chen X, Ishwaran H. Pathway hunting by random survival forests. Bioinformatics 2013;29:99-105.
McKinney BA, Crowe JE Jr, Guo J, et al. Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS Genet 2009;5:e1000432.
Kim Y, Kim H. Application of random forests to association studies using mitochondrial single nucleotide polymorphisms. Genom Inform 2007;5:168-73.
Sakoparnig T, Kockmann T, Paro R, et al. Binding profiles of chromatin-modifying proteins are predictive for transcriptional activity and promoter-proximal pausing. JComputBiol 2012;19:126-38.