Abstract :
[en] Experimentally determining the three-dimensional structure of a protein is a slow and expensive process. Nowadays, supervised machine learning techniques are widely used to predict protein structures, and in particular to predict surrogate annotations, which are much less complex than 3D structures.
This dissertation presents, on the one hand, methodological contributions for learning multiple tasks simultaneously and for selecting relevant feature representations, and on the other hand, biological contributions issued from the application of these techniques on several protein annotation problems.
Our first methodological contribution introduces a multi-task formulation for learning various protein structural annotation tasks. Unlike the traditional methods proposed in the bioinformatics literature, which mostly treated these tasks independently, our framework exploits the natural idea that multiple related prediction tasks should be designed simultaneously. Our empirical experiments on a set of five sequence labeling tasks clearly highlight the benefit of our multi-task approach against single-task approaches in terms of correctly predicted labels.
Our second methodological contribution focuses on the best way to identify a minimal subset of feature functions, {\em i.e.}, functions that encode properties of complex objects, such as sequences or graphs, into appropriate forms (typically, vectors of features) for learning algorithms. Our empirical experiments on disulfide connectivity pattern prediction and disordered regions prediction show that using carefully selected feature functions combined with ensembles of extremely randomized trees lead to very accurate models.
Our biological contributions are mainly issued from the results obtained by the application of our feature function selection algorithm on the problems of predicting disulfide connectivity patterns and of predicting disordered regions. In both cases, our approach identified a relevant representation of the data that should play a role in the prediction of disulfide bonds (respectively, disordered regions) and, consequently, in protein structure-function relationships. For example, the major biological contribution made by our method is the discovery of a novel feature function, which has - to our best knowledge - never been highlighted in the context of predicting disordered regions. These representations were carefully assessed against several baselines such as the 10th Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition.