Reference : Discussing the validation of high-dimensional probability distribution learning with ...
Scientific congresses and symposiums : Poster
Engineering, computing & technology : Computer science
Discussing the validation of high-dimensional probability distribution learning with mixtures of graphical models for inference
Schnitzler, François mailto [Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
Validation in Statistics and Machine Learning
from 06-10-2010 to 07-10-2010
Nicole Krämer
Anne-Laure Boulesteix
[en] bayesian networks ; Markov Trees ; Chow-Liu algorithm ; mixture of trees
[en] Exact inference on probabilistic graphical models quickly becomes intractable when the dimension of the problem increases. A weighted average (or mixture) of different simple graphical models can be used instead of a more complicated model to learn a distribution, allowing probabilistic inference to be much more efficient. I hope to discuss issues related to the validation of algorithms for learning such mixtures of models and to high-dimensional learning of probabilistic graphical models in general, and to gather valuable feedback and comments on my approach. The main problems are the difficulties to assess the accuracy of the algorithms and to choose a representative set of target distributions.

The accuracy of algorithms for learning probabilistic graphical models is often evaluated by comparing the structure of the resulting model to the target (e.g. Number of similar/dissimilar edges, score BDe etc). This approach however falls short when studying methods using a mixture of simple models : individually, these lack the representation power to model the true distribution, and only their combination allows them to compete with more sophisticated models. The Kullback-Leibler divergence is a measure of the difference between two probability densities, and can be used to compare any model learned from a dataset to the data generating distribution. For computational reasons, I however had to resort to a Monte Carlo estimation of this quantity for large problems (starting at around 200 variables).

Since probabilistic inference is the ultimate motivation for building these models, and not probability modelling, a more meaningful measure of accuracy could be obtained by comparing mixtures against a combination of state of the art model learning and approximate inference algorithms. However, the exact inference result cannot be easily assessed for interesting target distributions, since the use of mixtures is precisely considered because exact inference is not possible on said targets, and approximate inference would introduce a bias.

Selecting a target distribution used to generate the data sets on which the algorithms are evaluated also proved a challenge. The easiest solution was to generate them at random (although different approaches can be designed). These models are however likely to be rather different from real problems, and thus constitute a poor choice to assess the practical interest of mixture of models. Methods (e.g. linking multiple copies of a given network) have been developed to increase the size of models known by the community (e.g. the alarm network), and the obtained graphical models have been made available. These could however still be far from the kind of interactions present in a real setting. A better way to proceed could be to generate samples based on the equations describing a physical problem, to learn a probabilistic model as best as possible from this high-dimensional dataset, and to use it as target distribution.
Systèmes et Modélisation
Fonds pour la formation à la Recherche dans l'Industrie et dans l'Agriculture (Communauté française de Belgique) - FRIA ; Biomagnet IUAP network of the Belgian Science Policy Office ; Pascal2 network of excellence of the EC

There is no file associated with this reference.

Bookmark and Share SFX Query

All documents in ORBi are protected by a user license.