A walk into random forests: adaptation and application to Genome-Wide Association Studies

Botta, Vincent

Download

Doctoral thesis (Dissertations and theses)

A walk into random forests: adaptation and application to Genome-Wide Association Studies

Botta, Vincent

2013

Permalink
https://hdl.handle.net/2268/157690

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

botta-thesis.pdf

Author postprint (15.78 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

GWAS; genetic; genomic; random forest; machine learning; decision tree

Abstract :

[en] Understanding underlying mechanisms of common diseases is one of the major goals of current research in medicine. As most of these disorders are linked to genetic factors, identification of the associated variants forms an excellent strategy towards the elucidation of molecular and cellular dysfunctions, and in fine could lead to better personalised diagnostics and treatments. Genome-Wide Association Studies (GWAS) aim to discover variants spread over the genome that could lead, in isolation or in combination, to a particular trait or an unfortunate phenotype such as a disease. The basic idea behind these studies is to statistically analyse the genetic differences between groups of healthy (controls) and diseased (cases) individuals. Advances in genetic marker technology indeed allow for dense genotyping of hundreds of thousands of Single Nucleotide Polymorphisms (SNPs) per individual. This allows to characterise representative samples composed of several hundreds to several thousands of cases and controls, each one characterised by up to a million of genetic markers sampling the genomic variations among these individuals. The standard approach to genome wide association studies is based on univariate hypothesis tests. In this approach each genetic marker is analysed in isolation from the others, in order to assess its potential association with the studied phenotype, in practice by the computation of so-called p-values based on some statistical assumptions about the data-generation mechanism. Because of the very high ratio between the large number of SNPs genotyped and the limited number of individuals, multiple-testing corrections need to be applied when carrying out these analyses, leading to reduced statistical power. While this standard approach has been at the basis of many novel loci unravelled in the last years for several complex diseases, it has several intrinsic limitations. A first limitation is that this approach does not directly account for correlations among the explanatory variables. A second intrinsic limitation of GWAS is that they can't account for genetic interactions, i.e. causal effects that are only observed when specific combinations of mutations and/or non-mutations are present at the same time. The third limitation of univariate approaches is that they do not directly allow to assess the genetic risk, since many of the identified markers (with similarly small p-values) actually account for the same underlying causal factor: exploiting their information to predict the genetic risk is hence far from straightforward. Within bioinformatics, machine learning has actually become one of the major potential sources of progress. As a matter of fact, biology has become nowadays one of the main drivers of research in machine learning, and is by itself already a very competitive research field. Among the subfields of machine learning, supervised learning and its extensions such as semi-supervised learning, stand out as the most mature and at the same time most rapidly evolving area of research. Within this context, the purpose of this thesis was to study the application of random forest types of methods to genome wide association studies, with the twofold goal of (i) inferring predictive models able to asses disease risk and (ii) to identify causal mutations explaining the phenotype. The choice of this family of methods was originally motivated by the fact that these methods are a priori well suited for that kind of analysis due to some of their interesting properties. They are indeed able to deal efficiently with very large amounts of data without relying on strong assumptions about the underlying mechanisms linking genetic and environmental factors to phenotypes, and they can also provide interpretable information, in the form of scorings and/or rankings of SNPs so as to help in the identification of causal genetic loci. In the first part of this manuscript, we analyse the state-of-the art in the application field of genome wide association studies and in supervised machine learning, and subsequently describe in details the three tree-based ensemble methods that we have implemented and applied in our research; in Part II, we report our empirical investigations, in three successive steps, namely i.) a preliminary study on simulated datasets yielding controlled conditions with known ground-truth and allowing for a first sanity check of the T-Trees methods, in ideal conditions; ii.) a detailed study on a given real-life dataset concerning Crohn's disease, where we try to understand the main features of the three different algorithms in terms of predictive accuracy and capability of identification of relevant genetic information, and their sensitivity with respect to various kinds of quality control procedures and algorithmic parameters; iii.) a systematic replication study, where we confirm, on 7 different datasets from the Wellcome Trust Case Control Consortium, the main outcomes of our study on the Crohn's disease, while using default parameter settings.

Disciplines :

Computer science

Author, co-author :

Botta, Vincent ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Language :

English

Title :

A walk into random forests: adaptation and application to Genome-Wide Association Studies

Defense date :

November 2013

Institution :

ULiège - Université de Liège

Degree :

Docteur en sciences, orientation informatique

Promotor :

Wehenkel, Louis ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

President :

Geurts, Pierre ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Jury member :

Georges, Michel ; Université de Liège - ULiège > GIGA > GIGA Molecular Biomimetic and Protein Engineering Laboratory

Druet, Tom ; Université de Liège - ULiège > GIGA > GIGA Medical Genomics - Unit of Animal Genomics

Sinoquet, Christine

Saeys, Yvan

Van Steen, Kristel ; Université de Liège - ULiège > GIGA > GIGA Medical Genomics - Biostatistics, biomedicine and bioinformatics

Available on ORBi :

since 30 October 2013

Statistics

Number of views

685 (51 by ULiège)

Number of downloads

1170 (27 by ULiège)

More statistics