Doctoral thesis (Dissertations and theses)
Machine learning-based feature ranking: Statistical interpretation and gene network inference
Huynh-Thu, Vân Anh
2012
 

Files


Full Text
these_vananh_hyperref.pdf
Author postprint (9.97 MB)
Download

All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
bioinformatics; machine learning; systems biology
Abstract :
[en] Machine learning techniques, and in particular supervised learning methods, are nowadays widely used in bioinformatics. Two prominent applications that we target specifically in this thesis are biomarker discovery and regulatory network inference. These two problems are commonly addressed through the use of feature ranking methods that order the input features of a supervised learning problem from the most to the less relevant for predicting the output. This thesis presents, on the one hand, methodological contributions around machine learning-based feature ranking techniques and on the other hand, more applicative contributions on gene regulatory network inference. Our methodological contributions focus on the problem of selecting truly relevant features from machine learning-based feature rankings. Unlike the p-values returned by univariate tests, relevance scores derived from machine learning techniques to rank the features are usually not statistically interpretable. This lack of interpretability makes the identification of the truly relevant features among the top-ranked ones a very difficult task and hence prevents the wide adoption of these methods by practitioners. Our first contribution in this field concerns a procedure, based on permutation tests, that estimates for each subset of top-ranked features the probability for that subset to contain at least one irrelevant feature (called CER for "conditional error rate"). As a second contribution, we performed a large-scale evaluation of several, existing or novel, procedures, including our CER method, that all replace the original relevance scores with measures that can be interpreted in a statistical way. These procedures, which were assessed on several artificial and real datasets, differ greatly in terms of computing times and the tradeoff they achieve in terms of false positives and false negatives. Our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive. The problem of gene regulatory network inference can be formulated as several feature selection problems, each one aiming at discovering the regulators of one target gene. Within this family of methods, we developed the GENIE3 algorithm that exploits feature rankings derived from tree-based ensemble methods to infer gene networks from steady-state gene expression data. In a second step, we derived two extensions of GENIE3 that aim to infer regulatory networks from other types of data. The first extension exploits expression data provided by time course experiments, while the second extension is related to genetical genomics datasets, which contain expression data together with information about genetic markers. GENIE3 was best performer in the DREAM4 In Silico Multifactorial challenge in 2009 and in the DREAM5 Network Inference challenge in 2010, and its extensions perform very well compared to other methods on several artificial datasets.
Disciplines :
Computer science
Biochemistry, biophysics & molecular biology
Author, co-author :
Huynh-Thu, Vân Anh  ;  Université de Liège - ULiège > GIGA-Management : Coordination ALMA-GRID
Language :
English
Title :
Machine learning-based feature ranking: Statistical interpretation and gene network inference
Defense date :
09 January 2012
Number of pages :
160 + 30
Institution :
ULiège - Université de Liège
Degree :
Doctorat en Sciences de l'Ingénieur (Electricité et Electronique)
Promotor :
Wehenkel, Louis  ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Geurts, Pierre  ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
President :
Sepulchre, Rodolphe ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Jury member :
Van Steen, Kristel  ;  Université de Liège - ULiège > GIGA > GIGA Medical Genomics - Biostatistics, biomedicine and bioinformatics
Saeys, Yvan
Ambroise, Christophe
Di Bernardo, Diego
Küffner, Robert
Available on ORBi :
since 18 January 2012

Statistics


Number of views
1379 (106 by ULiège)
Number of downloads
1573 (54 by ULiège)

Bibliography


Similar publications



Contact ORBi