Abstract :
[en] At the intersection between artificial intelligence and statistics, supervised learning provides
algorithms to automatically build predictive models only from observations of a system. During the
last twenty years, supervised learning has been a tool of choice to analyze the always increasing
and complexifying data generated in the context of molecular biology, with successful applications
in genome annotation, function prediction, or biomarker discovery. Among supervised learning
methods, decision tree-based methods stand out as non parametric methods that have the unique
feature of combining interpretability, efficiency, and, when used in ensembles of trees, excellent
accuracy. The goal of this paper is to provide an accessible and comprehensive introduction to this
class of methods. The first part of the paper is devoted to an intuitive but complete description of
decision tree-based methods and a discussion of their strengths and limitations with respect to other
supervised learning methods. The second part of the paper provides a survey of their applications
in the context of computational and systems biology.
The supplementary material provides information about various non-standard extensions of the
decision tree-based approach to modeling, some practical guidelines for the choice of parameters
and algorithm variants depending on the practical ob jectives of their application, pointers to freely
accessible software packages, and a brief primer going through the different manipulations needed
to use the tree-induction packages available in the R statistical tool.
Scopus citations®
without self-citations
165