[en] Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.
Disciplines :
Life sciences: Multidisciplinary, general & others
Author, co-author :
Romagnoni, Alberto
Jegou, Simon
Van Steen, Kristel ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Bioinformatique
Wainrib, Gilles
Hugot, Jean-Pierre
Language :
English
Title :
Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data.
Baumgart, D. C. & Sandborn, W. J. Crohn’s disease. The Lancet 380, 1590–1605 (2012).
Wray, N. R., Yang, J., Goddard, M. E. & Visscher, P. M. The genetic interpretation of area under the roc curve in genomic profiling. PLoS genetics 6, e1000864 (2010).
Jostins, L. et al. Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491, 119 (2012).
Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nature genetics 47, 979 (2015).
Momozawa, Y. et al. Resequencing of positional candidates identifies low frequency il23r coding variants protecting against inflammatory bowel disease. Nature genetics 43, 43 (2011).
Huang, H. et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature 547, 173 (2017).
Yadav, P. et al. Genetic factors interact with tobacco smoke to modify risk for inflammatory bowel disease in humans and mice. Gastroenterology 153, 550–565 (2017).
Cordell, H. J. Detecting gene–gene interactions that underlie human diseases. Nature Reviews Genetics 10, 392 (2009).
Okser, S. et al. Regularized machine learning in the genetic prediction of complex traits. PLoS genetics 10, e1004754 (2014).
Weersma, R. K. et al. Molecular prediction of disease risk and severity in a large dutch crohn’s disease cohort. Gut 58, 388–395 (2009).
Van Lishout, F. et al. An efficient algorithm to perform multiple testing in epistasis screening. BMC bioinformatics 14, 138 (2013).
Lippert, C. et al. An exhaustive epistatic snp association analysis on expanded wellcome trust data. Scientific reports 3, 1099 (2013).
Abraham, G., Kowalczyk, A., Zobel, J. & Inouye, M. Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease. Genetic Epidemiology 37, 184–195 (2013).
Chen, G.-B. et al. Performance of risk prediction for inflammatory bowel disease based on genotyping platform and genomic risk score method. BMC medical genetics 18, 94 (2017).
Wei, Z. et al. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. The American Journal of Human Genetics 92, 1008–1012 (2013).
Ziegler, A., DeStefano, A. L., König, I. R. & Glaser, B. Data mining, neural nets, trees—problems 2 and 3 of genetic analysis workshop 15. Genetic epidemiology 31, S51–S60 (2007).
Chen, X. & Ishwaran, H. Random forests for genomic data analysis. Genomics 99, 323–329 (2012).
Evans, D. M., Visscher, P. M. & Wray, N. R. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Human molecular genetics 18, 3525–3531 (2009).
Kooperberg, C., LeBlanc, M. & Obenchain, V. Risk prediction using genome-wide association studies. Genetic epidemiology 34, 643–652 (2010).
Botta, V., Louppe, G., Geurts, P. & Wehenkel, L. Exploiting snp correlations within random forest for genome-wide association studies. PloS one 9, e93379 (2014).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436 (2015).
Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nature Reviews Genetics 16, 321 (2015).
Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. bioRxiv 142760 (2018).
Uppu, S., Krishna, A. & Gopalan, R. P. A deep learning approach to detect snp interactions. JSW 11, 965–975 (2016).
Cortes, A. & Brown, M. A. Promise and pitfalls of the immunochip. Arthritis research & therapy 13, 101 (2011).
Zeng, P. et al. Statistical analysis for genome-wide association study. Journal of biomedical research 29, 285 (2015).
McCarthy, M. I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics 9, 356 (2008).
Clayton, D. G. et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nature genetics 37, 1243 (2005).
Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS genetics 5, e1000529 (2009).
Balazard, F. Haplotype based genetic risk estimation for complex diseases. PeerJ PrePrints (2016).
Consortium, W. T. C. C. et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661 (2007).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).
Agresti, A. & Kateri, M. Categorical data analysis. In International encyclopedia of statistical science, 206–208 (Springer, 2011).
Moore, J. H., Asselbergs, F. W. & Williams, S. M. Bioinformatics challenges for genome-wide association studies. Bioinformatics 26, 445–455 (2010).
Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. & Lange, K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25, 714–721 (2009).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012).
He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. In European Conference on Computer Vision, 630–645 (Springer, 2016).
Chollet, F. et al. Keras, https://keras.io (2015).
Abadi, M. et al. TensorFlow: Large-scale machine learning on heterogeneous systems, Software available from tensorflow.org (2015).
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 119–139 (1997).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics 1189–1232 (2001).
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794 (ACM, 2016).
Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, 3149–3157 (2017).
Prokhorenkova, L., Gusev, G., Vorobev, A., Veronika Dorogush, A. & Gulin, A. Catboost: unbiased boosting with categorical features. arXiv preprint arXiv:1706.09516 (2017).
Yang, F. & Mao, K. Improving robustness of gene ranking by resampling and permutation based score correction and normalization. In Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on, 444–449 (IEEE, 2010).
Croix, J. A., Bhatia, S. & Gaskins, H. R. Inflammatory cues modulate the expression of secretory product genes, golgi sulfotransferases and sulfomucin production in ls174t cells. Experimental Biology and Medicine 236, 1402–1412 (2011).
West, N. R. et al. Oncostatin m drives intestinal inflammation and predicts response to tumor necrosis factor–neutralizing therapy in patients with inflammatory bowel disease. Nature medicine 23, 579 (2017).
Chen, G.-B. et al. Estimation and partitioning of (co) heritability of inflammatory bowel disease from gwas and immunochip data. Human molecular genetics 23, 4710–4720 (2014).