Article (Scientific journals)
Statistical interpretation of machine learning-based feature importance scores for biomarker discovery
Huynh-Thu, Vân Anh; Saeys, Yvan; Wehenkel, Louis et al.
2012In Bioinformatics, 28 (13), p. 1766-1774
Peer Reviewed verified by ORBi
 

Files


Full Text
Bioinformatics-2012-Huynh-Thu-1766-74.pdf
Publisher postprint (497.61 kB)
Download

All documents in ORBi are protected by a user license.

Send to



Details



Abstract :
[en] Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians. Results: We evaluated several, existing and novel, procedures that extract relevant features from rankings derived from machine learning approaches. These procedures replace the relevance scores with measures that can be interpreted in a statistical way, such as p-values, false discovery rates, or family wise error rates, for which it is easier to determine a significance level. Experiments were performed on several artificial problems as well as on real microarray datasets. Although the methods differ in terms of computing times and the tradeoff, they achieve in terms of false positives and false negatives, some of them greatly help in the extraction of truly relevant biomarkers and should thus be of great practical interest for biologists and physicians. As a side conclusion, our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive.
Disciplines :
Computer science
Biochemistry, biophysics & molecular biology
Author, co-author :
Huynh-Thu, Vân Anh  ;  Université de Liège - ULiège > GIGA-Management : Coordination ALMA-GRID
Saeys, Yvan
Wehenkel, Louis  ;  Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Geurts, Pierre  ;  Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Language :
English
Title :
Statistical interpretation of machine learning-based feature importance scores for biomarker discovery
Publication date :
25 April 2012
Journal title :
Bioinformatics
ISSN :
1367-4803
eISSN :
1367-4811
Publisher :
Oxford University Press - Journals Department, Oxford, United Kingdom
Volume :
28
Issue :
13
Pages :
1766-1774
Peer reviewed :
Peer Reviewed verified by ORBi
Available on ORBi :
since 25 June 2012

Statistics


Number of views
465 (66 by ULiège)
Number of downloads
484 (22 by ULiège)

Scopus citations®
 
86
Scopus citations®
without self-citations
83
OpenCitations
 
63
OpenAlex citations
 
104

Mendeley (226)
CiteULike (5)
publications
103
supporting
2
mentioning
88
contrasting
0
Smart Citations
103
2
88
0
Citing PublicationsSupportingMentioningContrasting
View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Bibliography


Similar publications



Sorry the service is unavailable at the moment. Please try again later.
Contact ORBi