Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies

[en] The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification.

Disciplines :

Engineering, computing & technology: Multidisciplinary, general & others

Author, co-author :

Botta, Vincent ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Louppe, Gilles ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Geurts, Pierre ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Algorith. des syst. en interaction avec le monde physique

Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Language :

English

Title :

Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies

Publication date :

02 April 2014

Journal title :

PLoS ONE

eISSN :

1932-6203

Publisher :

Public Library of Science, San Franscisco, United States - California

Peer reviewed :

Peer Reviewed verified by ORBi

Available on ORBi :

since 23 April 2014

Statistics

Number of views

385 (25 by ULiège)

Number of downloads

338 (8 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations

Bibliography

Balding DJ (2006) A tutorial on statistical methods for population association studies. Nat Rev Genet 7: 781-91.
Mccarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, et al. (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9: 356-369. (Pubitemid 351556063)
Bewick V, Cheek L, Ball J (2004) Statistics review 8: Qualitative data - tests of association. Critical Care 8: 46-53. (Pubitemid 38195570)
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, et al. (2007) Plink: a tool set for whole- genome association and population-based linkage analyses. American journal of human genetics 81: 559-575. (Pubitemid 47330214)
Wang H, Misztal I, Aguilar I, Legarra A, Muir WM (2012) Genome-wide association mapping including phenotypes from relatives without genotypes. Genetics Research 94: 73-83.
Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, et al. (2007) TASSEL: software for association mapping of complex traits in diverse samples. Journal of Gerontology 23: 2633-2635. (Pubitemid 350048371)
Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, et al. (2010) Variance component model to account for sample structure in genome-wide association studies. Nature genetics 42: 348-354.
Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, et al. (2009) From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes. PLoS Genetics 5: e1000678.
Moore JH, Asselbergs FW, Williams SM (2010) Bioinformatics challenges for genome-wide association studies. Bioinformatics 26: 445-55.
Abraham G, Kowalczyk A, Zobel J, Inouye M (2012) Performance and Robustness of Penalized and Unpenalized Methods for Genetic Prediction of Complex Human Disease. Genetic Epidemiology 37: 184-195.
Bureau A, Dupuis J, Hayward B, Falls K, Van Eerdewegh P (2003) Mapping complex traits using random forests. BMC Genetics 4: S64.
Lunetta K, Hayward LB, Segal J, Van Eerdewegh P (2004) Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics 5: 32.
Goldstein BA, Hubbard AE, Cutler A, Barcellos LF (2010) An application of random forests to a genome-wide association dataset: Methodological considerations & new findings. BMC genetics 11: 49.
Winham SJ, Colby CL, Freimuth RR, Wang X, De Andrade M, et al. (2012) SNP interaction detection with random forests in high-dimensional genetic data. BMC bioinformatics 13: 164.
Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, et al. (2012) Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Briefings in Bioinformatics.
González-Recio O, Forni S (2011) Genome-wide prediction of discrete traits using bayesian regressions and machine learning. Genetics Selection Evolution 43: 7.
Heidema AG, Boer J, Nagelkerke N, Mariman E, van der AD, et al. (2006) The challenge for genetic epidemiologists: how to analyze large numbers of snps in relation to complex diseases. BMC Genetics 7: 23.
Jiang R, Tang W, Wu X, Fu W (2009) A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics 10: S65.
De Lobel L, Geurts P, Baele G, Castro-Giner F, Kogevinas M, et al. (2010) A screening methodology based on random forests to improve the detection of gene-gene interactions. European Journal of Human Genetics 18: 1127-1132.
Nicodemus KK, Malley JD (2009) Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics 25: 1884-90.
Nicodemus KK, Malley JD, Strobl C, Ziegler A (2010) The behaviour of random forest permutationbased variable importance measures under predictor correlation. BMC bioinformatics 11: 110.
Nicodemus KK (2011) Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures. Briefings in Bioinformatics 12: 369-373.
Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL (2009) Performance of random forest when snps are in linkage disequilibrium. BMC Bioinformatics 10: 78.
Botta V, Geurts P, Hansoul S, Wehenkel L (2008) Raw genotypes vs haplotype blocks for genome wide association studies by random forests. Proc of MLSB 2008, second workshop on Machine Learning in Systems Biology.
WTCCC (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661-78.
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth Publishing.
Breiman L (2001) Random forests. Machine Learning 45: 5-32.
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Machine Learning 36: 3-42.
Breiman L (2002) Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA.
Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems 26. p. 9.
Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. arXiv preprint cs/9408103.
Gama J (2004) Functional trees. Machine Learning 55: 219-250.
Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics (Oxford, England) 21: 263-265. (Pubitemid 40202029)
Botta V (2013) A walk into random forests: adaptation and application to Genome-Wide Association Studies. Université de Liège, Liège, Belgium.
Hailpern SM, Visintainer PF (2003) Odds ratios and logistic regression: further examples of their use and interpretation. interpretation 318: 0.356.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, et al. (2011) Scikit-learn: Machine Learning in Python. The Journal of Machine Learning Research 12.
Jostins L, Ripke S, Weersma RK, Duerr RH, McGovern DP, et al. (2013) Host-microbe interactions have shaped the genetic architecture of inammatory bowel disease. Nature 490: 119-124.
Ziegler A, Van Steen K, Wellek S (2010) Investigating Hardy-Weinberg equilibrium in case-control or cohort studies or meta-analysis. Breast Cancer Research and Treatment 128: 197-201.
Nielsen DM, Ehm MG, Weir BS (1998) Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. American journal of human genetics 63: 1531-1540. (Pubitemid 30418550)
Ramos EM, Hoffman D, Junkins HA, Maglott D, Phan L, et al. (2013) Phenotype-Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources. European journal of human genetics: EJHG.