Supervised learning with decision tree-based methods in computational and systems biology

[en] At the intersection between artiﬁcial intelligence and statistics, supervised learning provides algorithms to automatically build predictive models only from observations of a system. During the last twenty years, supervised learning has been a tool of choice to analyze the always increasing and complexifying data generated in the context of molecular biology, with successful applications in genome annotation, function prediction, or biomarker discovery. Among supervised learning methods, decision tree-based methods stand out as non parametric methods that have the unique feature of combining interpretability, eﬃciency, and, when used in ensembles of trees, excellent accuracy. The goal of this paper is to provide an accessible and comprehensive introduction to this class of methods. The ﬁrst part of the paper is devoted to an intuitive but complete description of decision tree-based methods and a discussion of their strengths and limitations with respect to other supervised learning methods. The second part of the paper provides a survey of their applications in the context of computational and systems biology. The supplementary material provides information about various non-standard extensions of the decision tree-based approach to modeling, some practical guidelines for the choice of parameters and algorithm variants depending on the practical ob jectives of their application, pointers to freely accessible software packages, and a brief primer going through the diﬀerent manipulations needed to use the tree-induction packages available in the R statistical tool.

Disciplines :

Computer science

Author, co-author :

Geurts, Pierre ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Irrthum, Alexandre ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Language :

English

Title :

Supervised learning with decision tree-based methods in computational and systems biology

Publication date :

December 2009

Journal title :

Molecular Biosystems

ISSN :

1742-206X

eISSN :

1742-2051

Publisher :

Royal Society of Chemistry (RSC), United Kingdom

Volume :

Issue :

Pages :

1593-1605

Peer reviewed :

Peer Reviewed verified by ORBi

Available on ORBi :

since 15 October 2009

Statistics

Number of views

360 (36 by ULiège)

Number of downloads

17 (8 by ULiège)

More statistics

Scopus citations^®

195

Scopus citations^®
without self-citations

192

OpenCitations

125

OpenAlex citations

233

Bibliography

C. Kingsford S. L. Salzberg Nat. Biotechnol. 2008 26 1011 3
P. Larrañaga B. Calvo R. Santana C. Bielza J. Galdiano I. Inza J. A. Lozano R. Armañanzas G. Santafé A. Pérez V. Robles Briefings Bioinf. 2006 7 86 112
A. L. Tarca V. J. Carey X.-w. Chen R. Romero S. Draghici PLoS Comput. Biol. 2007 3 e116
A. Ben-Hur C. S. Ong S. Sonnenburg B. Schölkopf G. Rätsch PLoS Comput. Biol. 2008 4 e1000173
L. Lancashire C. Lemetre G. Ball Briefings Bioinf. 2009 10 315 329
R. Duda, P. Hart and D. Stork, Pattern Classification, John Wiley & Sons, New York, 2nd edn, 2000
T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, New York, 2001
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006
C. Ambroise G. McLachlan Proc. Natl. Acad. Sci. U. S. A. 2002 99 6562 6566
T. Fawcett Pattern Recognit. Lett. 2006 27 861 874
L. Breiman, J. Friedman, R. Olsen and C. Stone, Classification and Regression Trees, Wadsworth International, California, 1984
J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1986
E. Hunt, Concept Learning, Wiley, New York, 1962
J. Quinlan Mach. Learn. 1986 1 81 106
L. Wehenkel, Automatic Learning Techniques in Power Systems, Kluwer Academic, Boston, 1998
F. Provost P. Domingos Mach. Learn. 2003 52 199 215
J. Mingers Mach. Learn. 1989 4 227 243
L. Breiman Mach. Learn. 1996 24 123 140
L. Breiman Mach. Learn. 2001 45 5 32
P. Geurts D. Ernst L. Wehenkel Mach. Learn. 2006 63 3 42
Y. Freund and R. E. Schapire, Proceedings of the Second European Conference on Computational Learning Theory, 1995, pp. 23-27
H. Drucker and C. Cortes, Advances in Neural Information Processing Systems, 1996, pp. 479-485
E. Bauer R. Kohavi Mach. Learn. 1999 36 105 139
R. Meir and G. Rätsch, in Advanced Lectures on Machine Learning, Springer, 2003, ch. An introduction to Boosting and Leveraging
J. Friedman T. Hastie R. Tibshirani Ann. Stat. 2000 28 337 374
Y. Saeys I. Inza P. Larranaga Bioinformatics 2006 23 2507 2517
C. Strobl A.-L. Boulesteix T. Kneib T. Augustin A. Zeileis BMC Bioinformatics 2008 9 307
C. Strobl A. Boulesteix A. Zeileis T. Hothorn BMC Bioinformatics 2007 8 25
K. Archer R. Kimes Comput. Stat. Data Anal. 2008 52 2249 2260
V. Huynh-Thu L. Wehenkel P. Geurts JMLR Workshop and Conference proceedings 2008 4 60 73
G. Biau L. Devroye G. Lugosi J. Mach. Learn. Res. 2008 9 2015 2033
S. Salzberg J. Comput. Biol. 1995 2 473 485
R. V. Davuluri I. Grosse M. Q. Zhang Nat. Genet. 2001 29 412 417
A. Stark P. Kheradpour L. Parts J. Brennecke E. Hodges G. J. Hannon M. Kellis Genome Res. 2007 17 1865 1879
E. Kretschmann W. Fleischmann R. Apweiler Bioinformatics 2001 17 920 926
A. Clare R. D. King Bioinformatics 2003 19 Suppl 2 ii42 ii49
M. M. Gromiha Y. Yabuki BMC Bioinformatics 2008 9 135
J. Y. Yang M. Q. Yang A. K. Dunker Y. Deng X. Huang BMC Genomics 2008 9 suppl 1 S7
Y. Q. Shen G. Burger BMC Bioinformatics 2007 8 420
C.-W. Tung S.-Y. Ho BMC Bioinformatics 2008 9 310
Z. R. Yang Bioinformatics 2005 21 2644 2650
A. Ben-Dor L. Bruhn N. Friedman I. Nachman M. Schummer Z. Yakhini J. Comput. Biol. 2000 7 559 583
A.-L. Boulesteix G. Tutz K. Strimmer Bioinformatics 2003 19 2465 2472
R. Díaz-Uriarte S. A. de Andrés BMC Bioinformatics 2006 7 3
H.-Y. Chen S.-L. Yu C.-H. Chen G.-C. Chang C.-Y. Chen A. Yuan C.-L. Cheng C.-H. Wang H.-J. Terng S.-F. Kao W.-K. Chan H.-N. Li C.-C. Liu S. Singh W. J. Chen J. J. W. Chen P.-C. Yang N. Engl. J. Med. 2007 356 11 20
Y. Qu B.-L. Adam Y. Yasui M. D. Ward L. H. Cazares P. F. Schellhammer Z. Feng O. J. Semmes G. L. Wright Clin. Chem. 2002 48 1835 1843
H. Liu J. Li L. Wong Genome Informatics 2002 13 51 60
B. Wu T. Abbott D. Fishman W. McMurray G. Mor K. Stone D. Ward K. Williams H. Zhao Bioinformatics 2003 19 1636 1643
G. Izmirlian Ann. N. Y. Acad. Sci. 2004 1020 154 174
P. Geurts M. Fillet D. de Seny M.-A. Meuwis M. Malaise M.-P. Merville L. Wehenkel Bioinformatics 2005 21 3138 3145
Y. Yu S. Chen L.-S. Wang W.-L. Chen W.-J. Guo H. Yan W.-H. Zhang C.-H. Peng S.-D. Zhang H.-W. Li G.-Q. Chen Oncology 2005 68 79 86
J. Cui X. Kang Z. Dai C. Huang H. Zhou K. Guo Y. Li Y. Zhang R. Sun J. Chen Y. Li Z. Tang T. Uemura Y. Liu J. Cancer Res. Clin. Oncol. 2007 133 825 834
Y. Su J. Shen H. Qian H. Ma J. Ji H. Ma L. Ma W. Zhang L. Meng Z. Li J. Wu G. Jin J. Zhang C. Shou Cancer Sci. 2007 98 37 43
Y.-S. Wei Y.-H. Zheng W.-B. Liang J.-Z. Zhang Z.-H. Yang M.-L. Lv J. Jia L. Zhang Cancer 2008 112 544 551
L. Zhang S. Wong O. King F. Roth BMC Bioinformatics 2004 5 15
Y. Qi J. Klein-Seetharaman Z. Bar-Joseph Pacific Symposium of Biocomputing 2005
X.-W. Chen M. Liu Bioinformatics 2005 21 4394 4400
P. Geurts N. Touleimat M. Dutreix F. d'Alché Buc BMC Bioinformatics 2007 8 suppl 2 S4
N. J. Krogan G. Cagney H. Yu G. Zhong X. Guo A. Ignatchenko J. Li S. Pu N. Datta A. P. Tikuisis T. Punna J. M. Peregrín-Alvarez M. Shales X. Zhang M. Davey M. D. Robinson A. Paccanaro J. E. Bray A. Sheung B. Beattie D. P. Richards V. Canadien A. Lalev F. Mena P. Wong A. Starostine M. M. Canete J. Vlasblom S. Wu C. Orsi S. R. Collins S. Chandran R. Haw J. J. Rilstone K. Gandi N. J. Thompson G. Musso P. S. Onge S. Ghanny M. H. Y. Lam G. Butland A. M. Altaf-Ul S. Kanaya A. Shilatifard E. O'Shea J. S. Weissman C. J. Ingles T. R. Hughes J. Parkinson M. Gerstein S. J. Wodak A. Emili J. F. Greenblatt Nature 2006 440 637 643
A. D. J. van Dijk C. J. F. ter Braak R. G. Immink G. C. Angenent R. C. H. J. van Ham Bioinformatics 2008 24 26 33
X. wen Chen J. C. Jeong Bioinformatics 2009 25 585 591
M. ikić S. Tomić K. Vlahoviek PLoS Comput. Biol. 2009 5 e1000278
J. Wu H. Liu X. Duan Y. Ding H. Wu Y. Bai X. Sun Bioinformatics 2009 25 30 35
A. J. Bordner Bioinformatics 2008 24 2865 2871
M. Lippi A. Passerini M. Punta B. Rost P. Frasconi Bioinformatics 2008 24 2094 2095
S. L. Wong L. V. Zhang A. H. Y. Tong Z. Li D. S. Goldberg O. D. King G. Lesage M. Vidal B. Andrews H. Bussey C. Boone F. P. Roth Proc. Natl. Acad. Sci. U. S. A. 2004 101 15682 15687
K. C. Chipman A. K. Singh BMC Bioinformatics 2009 10 17
K. L. Lunetta L. B. Hayward J. Segal P. V. Eerdewegh BMC Genet. 2004 5 32
R. Jiang W. Tang X. Wu W. Fu BMC Bioinformatics 2009 10 suppl 1 S65
S. S. F. Lee L. Sun R. Kustra S. B. Bull Bioinformatics 2008 24 1603 1610
N. López-Bigas C. A. Ouzounis Nucleic Acids Res. 2004 32 3108 3114
P. Zhang H. Sheng R. Uehara BMC Bioinformatics 2004 5 89
J. Schlecht M. E. Kaplan K. Barnard T. Karafet M. F. Hammer N. C. Merchant PLoS Comput. Biol. 2008 4 e1000093
L. Bao Y. Cui Bioinformatics 2005 21 2185 2190
J. Hu C. Yan BMC Bioinformatics 2008 9 297
N. Beerenwinkel B. Schmidt H. Walter R. Kaiser T. Lengauer D. Hoffmann K. Korn J. Selbig Proc. Natl. Acad. Sci. U. S. A. 2002 99 8271 8276
T. Schlitt A. Brazma BMC Bioinformatics 2007 8 suppl 6 S9
L. A. Soinov M. A. Krestyaninova A. Brazma GenomeBiology 2003 4 R6
T. M. Phuong D. Lee K. H. Lee Bioinformatics 2004 20 750 757
M. Middendorf A. Kundaje C. Wiggins Y. Freund C. Leslie Bioinformatics 2004 20 suppl-1 i232 i240
A. Kundaje M. Middendorf M. Shah C. H. Wiggins Y. Freund C. Leslie BMC Bioinformatics 2006 7 suppl 1 S5
J. Ruan W. Zhang Bioinformatics 2006 22 332 340
E. Segal M. Shapira A. Regev D. Pe'er D. Botstein D. Koller N. Friedman Nat. Genet. 2003 34 166 176
X. Chen M. Blanchette BMC Bioinformatics 2007 8 suppl 10 S2
J. Selbig T. Mevissen T. Lengauer Bioinformatics 1999 15 1039 1046
J. A. Siepen S. E. Radford D. R. Westhead Protein Sci. 2003 12 2348 59
W. A. McLaughlin H. M. Berman J. Mol. Biol. 2003 330 43 55
S. G. Megason S. E. Fraser Cell 2007 130 784 95
A. R. Kherlopian T. Song Q. Duan M. A. Neimark M. J. Po J. K. Gohagan A. F. Laine BMC Syst. Biol. 2008 2 74
H. Peng Bioinformatics 2008 24 1827 1836
G. Giannone B. Dubin-Thaler O. Rossier Y. Cai O. Chaga G. Jiang W. Beaver H. Dobereiner Y. Freund G. Borisy M. Sheetz Cell 2007 128 561 575
R. Marée P. Geurts L. Wehenkel BMC Cell Biol. 2007 8 suppl 1 S2
R. Liu Y. Freund G. Spraggon Acta Crystallogr., Sect. D: Biol. Crystallogr. 2008 64 1187 95