Abstract :
[en] For the last decade, the neuroscience field has observed the emergence of
machine learning methods for the analysis of neuroimaging data. Unlike univariate
methods that consider voxels one per one, these techniques analyse relationships
between several voxels and are able to detect multivariate patterns.
In the context of neurodegenerative diseases, such as Alzheimer’s disease (AD),
they can be used to design a diagnosis system and to find in neuroimages the
patterns responsible for the disease. The context of the work presented here
is thus the field of pattern recognition with neuroimaging. Our objective is to
explore the possibilities that tree ensemble methods, such as Random Forests,
offer in this domain in general, and in particular in the context of AD research.
These methods suit very well the needs of this domain, as they combine very
good predictive performances and provide interpretable results in the form of
variable importance scores. Our contributions include both methodological developments
around tree ensemble methods and applications of these methods
on real datasets.
The methodological part of the thesis focuses on the analysis and the improvement
of Random Forests variable importances for neuroimaging problems.
Typical datasets in this domain are of very high dimensionality (hundreds of
thousands of voxels) and contain comparatively very few samples (tens or hundreds
of patients). Our first contribution is a theoretical and empirical analysis
of how importance scores behave in such extreme settings, depending on
the method parameters. We then propose several improvements of importance
scores in such settings that take advantage of either the spatial structure between
the features or a pre-defined partitioning of these features into groups.
Finally, we address an issue with Random Forests importances, which is to find
a threshold between truly relevant and irrelevant variables. For this purpose,
we adapt several statistical methods proposed in the bioinformatics literature.
These methods are extended to compute a statistical score for groups of features
instead of individual features. This adaptation at the group level has been
raised from our expectation to find groups of voxels explaining a disease instead
of isolated voxels. We show that working at the group level leads to a higher
statistical power than working at the feature level. The approach is applied on a
real dataset for the prognosis of AD, where it is shown to highlight brain regions
that are consistent with results in the literature.
In the second part of the thesis, we show different applications of Random
Forests for AD research. First, we use tree-based ensemble methods in order
to clinically characterize two different metabolic profiles observed in PET scans
of AD patients. Second, we carry out an empirical comparison that shows that
Random Forests are competitive with linear methods, in terms of accuracy and
interpretability, on different real datasets related to three research questions
about AD: the diagnosis of demented patients, the prognosis of mild cognitively
impaired (MCI) patients, and the differentiation of MCI and AD patients.