Machine learning-based feature ranking: Statistical interpretation and gene network inference

Huynh-Thu, Vân Anh

Download

Doctoral thesis (Dissertations and theses)

Machine learning-based feature ranking: Statistical interpretation and gene network inference

Huynh-Thu, Vân Anh

2012

Permalink
https://hdl.handle.net/2268/108611

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

these_vananh_hyperref.pdf

Author postprint (9.97 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

bioinformatics; machine learning; systems biology

Abstract :

[en] Machine learning techniques, and in particular supervised learning methods, are nowadays widely used in bioinformatics. Two prominent applications that we target specifically in this thesis are biomarker discovery and regulatory network inference. These two problems are commonly addressed through the use of feature ranking methods that order the input features of a supervised learning problem from the most to the less relevant for predicting the output. This thesis presents, on the one hand, methodological contributions around machine learning-based feature ranking techniques and on the other hand, more applicative contributions on gene regulatory network inference. Our methodological contributions focus on the problem of selecting truly relevant features from machine learning-based feature rankings. Unlike the p-values returned by univariate tests, relevance scores derived from machine learning techniques to rank the features are usually not statistically interpretable. This lack of interpretability makes the identification of the truly relevant features among the top-ranked ones a very difficult task and hence prevents the wide adoption of these methods by practitioners. Our first contribution in this field concerns a procedure, based on permutation tests, that estimates for each subset of top-ranked features the probability for that subset to contain at least one irrelevant feature (called CER for "conditional error rate"). As a second contribution, we performed a large-scale evaluation of several, existing or novel, procedures, including our CER method, that all replace the original relevance scores with measures that can be interpreted in a statistical way. These procedures, which were assessed on several artificial and real datasets, differ greatly in terms of computing times and the tradeoff they achieve in terms of false positives and false negatives. Our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive. The problem of gene regulatory network inference can be formulated as several feature selection problems, each one aiming at discovering the regulators of one target gene. Within this family of methods, we developed the GENIE3 algorithm that exploits feature rankings derived from tree-based ensemble methods to infer gene networks from steady-state gene expression data. In a second step, we derived two extensions of GENIE3 that aim to infer regulatory networks from other types of data. The first extension exploits expression data provided by time course experiments, while the second extension is related to genetical genomics datasets, which contain expression data together with information about genetic markers. GENIE3 was best performer in the DREAM4 In Silico Multifactorial challenge in 2009 and in the DREAM5 Network Inference challenge in 2010, and its extensions perform very well compared to other methods on several artificial datasets.

Disciplines :

Computer science
Biochemistry, biophysics & molecular biology

Author, co-author :

Huynh-Thu, Vân Anh ; Université de Liège - ULiège > GIGA-Management : Coordination ALMA-GRID

Language :

English

Title :

Machine learning-based feature ranking: Statistical interpretation and gene network inference

Defense date :

09 January 2012

Number of pages :

160 + 30

Institution :

ULiège - Université de Liège

Degree :

Doctorat en Sciences de l'Ingénieur (Electricité et Electronique)

Promotor :

Wehenkel, Louis ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Geurts, Pierre ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

President :

Sepulchre, Rodolphe ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Jury member :

Van Steen, Kristel ; Université de Liège - ULiège > GIGA > GIGA Medical Genomics - Biostatistics, biomedicine and bioinformatics

Saeys, Yvan

Ambroise, Christophe

Di Bernardo, Diego

Küffner, Robert

Available on ORBi :

since 18 January 2012

Statistics

Number of views

1586 (110 by ULiège)

Number of downloads

1746 (57 by ULiège)

More statistics