LD; haplotype; principal component analysis; PCA; ipPCA
Abstract :
[en] Objective
To identify and differentiate between subpopulations using a rich set of genetic markers, as using reduced sets of genetic markers for these purposes can become challenging, especially when similar geographic regions are involved or when spurious patterns are likely to exist.
Method
Single Nucleotide Polymorphisms (SNPs) are commonly used to capture variations between populations and often genome-wide SNP data are pruned based on linkage disequilibrium (LD) patterns. Notably, haplotype composition and the pattern of LD between markers may vary between larger populations but may also play a role within more confined geographic regions. Indeed, knowledge about haplotypes in unrelated individuals can reveal useful information about genetic ancestry. Here, we use iterative pruning principal component analysis (ipPCA) [1] to identify and characterize subpopulations in an unsupervised way. As input data, either pruned genome-wide SNP data are used (using PLINK 1.9 with the "indep-pairwise" option, window size = 100k, r2 < 0.25) or multilocus haplotype information derived from the genome-wide SNP panel (using BEAGLE 3.3.2 to infer haplotype). These approaches are applied to real-life data from 992 Thai individuals [2].
Result
Preliminary results indicate that ipPCA applied to pruned SNP data or ipPCA that explicitly uses multilocus information (haplotypes) give complementary information about population substructure for geographically confined populations such as the Thai samples in this study. Both methods address different aspects of population structure. Detailed simulation studies are needed to identify the optimal scenarios for haplotype-based ipPCA.
Conclusion
In this work, we propose to combine an LD-based haplotype encoding scheme with the ipPCA machinery to retrieve fine population substructures. Despite the complexities that are associated with haplotype inference, added value can be obtained when the LD structure between SNPs is exploited in the search for relevant population strata.
References
1. Intarapanich, A., et al., Iterative pruning PCA improves resolution of highly structured populations. BMC Bioinformatics, 2009. 10: p. 382.
2. Wangkumhang, P., et al., Insight into the peopling of Mainland Southeast Asia from Thai population genetic structure. PLoS One, 2013. 8(11): p. e79522.
Research Center/Unit :
Systems and Modeling Unit
Disciplines :
Life sciences: Multidisciplinary, general & others
Author, co-author :
Chaichoompu, Kridsadakorn ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Bioinformatique
Fouladi, Ramouna ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Bioinformatique
Wangkumhang, Pongsakorn; National Center for Genetic Engineering and Biotechnology, Thailand > Genome Institute > Biostatistics and informatics Laboratory
Wilantho, Alisa; National Center for Genetic Engineering and Biotechnology, Thailand > Genome Institute > Biostatistics and informatics Laboratory
Chareanchim, Wanwisa; National Center for Genetic Engineering and Biotechnology, Thailand > Genome Institute > Biostatistics and informatics Laboratory
Tongsima, Sissades; National Center for Genetic Engineering and Biotechnology, Thailand > Genome Institute > Biostatistics and informatics Laboratory
Sakuntabhai, Anavaj; Institut Pasteur, France > Functional Genetics of Infectious Diseases Unit
Van Steen, Kristel ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Bioinformatique
Language :
English
Title :
LD-based haplotype encoding scheme with iterative pruning principal component analysis (ipPCA) to retrieve population substructures