Abstract :
[en] Because of the rapid progress of computer and information technology, large
amounts of data are nowadays available in a lot of domains. Automatic learning
aims at developing algorithms able to produce synthetic high-level information,
or models, from this data. Learning algorithms are generally evaluated
according to three different criteria: interpretability (how well the model
helps to understand the data), predictive accuracy (how well the model can
predict unseen situations), and computational efficiency (how fast is the
algorithm and how it scales to large databases). This thesis explores two
issues in automatic learning: the improvement of the well-known decision tree
induction method and the problem of learning classification models for time
series data.
Decision tree induction method is an automatic learning algorithm which focuses
on the modeling of input/output relationships. While this algorithm is
among the fastest and most interpretable methods, its accuracy is not always
competitive with respect to other algorithms. It is commonly admitted that
this suboptimality is due to the excessive variance of this method. We first
carry out an empirical study which shows quantitatively how important this
variance is, i.e. how strongly decision trees depend on the random nature of
the database used to infer them. These experiments confirm that this variance
is detrimental not only from the point of view of accuracy but also from the
point of view of interpretability.
With the goal of improving both interpretability and accuracy, we consider
three variance reduction techniques for decision trees. First, in the goal of
improving mainly interpretability, we propose several methods which try to
stabilize the parameters chosen during tree induction. While these methods
succeed in reducing the variability of the parameters, they produce only a
slight improvement of the accuracy. Then, we consider perturb and combine
algorithms (e.g. bagging, boosting) which consist in combining the predictions
of several models obtained by randomizing in some way the learning process.
Inspired by the high variance of the parameters defining a decision tree, we
propose an extremely randomized decision tree induction algorithm, called
extra-tree, which chooses all parameters at random during induction. The
aggregation of several of these extra-trees gives an important reduction of
variance and this algorithm compares favorably in terms of accuracy and
computational efficiency with both bagging and boosting. Because of the
randomization of the parameters, the resulting method is also competitive with
classical decision tree induction in terms of computational efficiency. In
addition to these two approaches, we propose a ``dual'' perturb and combine
algorithm which delays the perturbation at the prediction stage and hence
requires only one model. In combination with decision tree, this method
actually bridges the gap between single decision trees and perturb and combine
algorithms. Of the first, it saves the interpretability (by using only one
model), and with perturb and combine algorithm, it shares some of the accuracy
(by reducing the variance).
The second topic of the thesis is the problem of time series classification.
The most direct way to solve this problem is to apply existing learning
algorithms on low-level variables which correspond to the values of a time
series at several time points. Experiments with the tree-based algorithms
studied in the first part of the thesis shows that this approach is limited. A
variance reduction techniques is then proposed specifically for this kind of
data which consists in aggregating the prediction given by a classification
model for subsequences of time series. Since this method does not provide
interpretable models, we propose a second method which extends decision tree
tests by allowing them to detect local shift invariant properties, or patterns,
in time series. The study proposed in this part of the thesis is only a first
step in the domain but our conclusions give some future work directions for
handling complex type of data with automatic learning methods.