[en] We consider two different representations of the input data for genome-wide association studies using random forests, namely raw genotypes described by a few thousand to a few hundred thousand discrete variables each one describing a single nucleotide polymorphism, and haplotype block contents, represented by the combinations of about 10 to 100 adjacent and correlated genotypes. We adapt random forests to exploit haplotype blocks, and compare this with the use of raw genotypes, in terms of predictive power and localization of causal mutations, by using simulated datasets with one or two interacting effects.
Disciplines :
Engineering, computing & technology: Multidisciplinary, general & others
Author, co-author :
Botta, Vincent ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Hansoul, Sarah ; Université de Liège - ULiège > Département de productions animales > GIGA-R : Génomique animale
Geurts, Pierre ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Language :
English
Title :
Raw genotypes vs haplotype blocks for genome wide association studies by random forests
Publication date :
September 2008
Audience :
International
Main work title :
Proc. of MLSB 2008, second workshop on Machine Learning in Systems Biology