Iterative pruning PCA improves resolution of highly structured populations.

Algorithms; Animals; Computational Biology/methods; Genetic Variation; Genetics, Population; Humans; Models, Genetic; Population/genetics; Principal Component Analysis/methods

Abstract :

[en] BACKGROUND: Non-random patterns of genetic variation exist among individuals in a population owing to a variety of evolutionary factors. Therefore, populations are structured into genetically distinct subpopulations. As genotypic datasets become ever larger, it is increasingly difficult to correctly estimate the number of subpopulations and assign individuals to them. The computationally efficient non-parametric, chiefly Principal Components Analysis (PCA)-based methods are thus becoming increasingly relied upon for population structure analysis. Current PCA-based methods can accurately detect structure; however, the accuracy in resolving subpopulations and assigning individuals to them is wanting. When subpopulations are closely related to one another, they overlap in PCA space and appear as a conglomerate. This problem is exacerbated when some subpopulations in the dataset are genetically far removed from others. We propose a novel PCA-based framework which addresses this shortcoming. RESULTS: A novel population structure analysis algorithm called iterative pruning PCA (ipPCA) was developed which assigns individuals to subpopulations and infers the total number of subpopulations present. Genotypic data from simulated and real population datasets with different degrees of structure were analyzed. For datasets with simple structures, the subpopulation assignments of individuals made by ipPCA were largely consistent with the STRUCTURE, BAPS and AWclust algorithms. On the other hand, highly structured populations containing many closely related subpopulations could be accurately resolved only by ipPCA, and not by other methods. CONCLUSION: The algorithm is computationally efficient and not constrained by the dataset complexity. This systematic subpopulation assignment approach removes the need for prior population labels, which could be advantageous when cryptic stratification is encountered in datasets containing individuals otherwise assumed to belong to a homogenous population.

Disciplines :

Life sciences: Multidisciplinary, general & others

Author, co-author :

Intarapanich, Apichart

Shaw, Philip J.

Assawamakin, Anunchai

Wangkumhang, Pongsakorn

Ngamphiw, Chumpol

Chaichoompu, Kridsadakorn ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Bioinformatique

Piriyapongsa, Jittima

Tongsima, Sissades

Language :

English

Title :

Iterative pruning PCA improves resolution of highly structured populations.

Publication date :

2009

Journal title :

BMC Bioinformatics

eISSN :

1471-2105

Publisher :

BioMed Central, United Kingdom

Volume :

Pages :

382

Peer reviewed :

Peer Reviewed verified by ORBi

Available on ORBi :

since 31 January 2014

Statistics

Number of views

136 (5 by ULiège)

Number of downloads

17 (1 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet 2003, 361(9357):598-604. 10.1016/S0140-6736(03)12520-2, 12598158.
Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics 2000, 155(2):945-959. 1461096, 10835412.
Consortium IH A haplotype map of the human genome. Nature 2005, 437(7063):1299-1320. 10.1038/nature04226, 1880871, 16255080.
Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 2003, 164(4):1567-1587. 1462648, 12930761.
Purcell S, Sham P. Properties of structured association approaches to detecting population stratification. Human heredity 2004, 58(2):93-107. 10.1159/000083030, 15711089.
Wu B, Liu N, Zhao H. PSMIX: an R package for population structure inference via maximum likelihood method. BMC bioinformatics 2006, 7:317. 10.1186/1471-2105-7-317, 1550430, 16792813.
Tang H, Peng J, Wang P, Risch NJ. Estimation of individual admixture: analytical and study design considerations. Genetic epidemiology 2005, 28(4):289-301. 10.1002/gepi.20064, 15712363.
Corander J, Marttinen P. Bayesian identification of admixture events using multilocus molecular markers. Molecular ecology 2006, 15(10):2833-2843.
Corander J, Marttinen P, Siren J, Tang J. Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC bioinformatics 2008, 9:539. 10.1186/1471-2105-9-539, 2629778, 19087322.
Chen C, Durand E, Forbes F, François O. Bayesian clustering algorithms ascertaining spatial population structure: A new computer program and a comparison study. Molecular Ecology Notes 2007, 7(5):747-756.
Bauchet M, McEvoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, Deka R, Bradley DG, Shriver MD. Measuring European population stratification with microarray genotype data. American journal of human genetics 2007, 80(5):948-956. 10.1086/513477, 1852743, 17436249.
Reeves PA, Richards CM. Accurate Inference of Subtle Population STructure (and Other Genetic Discontinuities) Using Proncipal Coordinates. PLoS ONE 2009, 4(1). 10.1371/journal.pone.0004269, 2625398, 19172174.
Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS genetics 2006, 2(12):e190. 10.1371/journal.pgen.0020190, 1713260,1713260, 17194218.
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics 2006, 38(8):904-909. 10.1038/ng1847, 16862161.
Han J, Kraft P, Nan H, Guo Q, Chen C, Qureshi A, Hankinson SE, Hu FB, Duffy DL, Zhao ZZ. A genome-wide association study identifies novel alleles associated with hair color and skin pigmentation. PLoS genetics 2008, 4(5):e1000074. 10.1371/journal.pgen.1000074, 2367449, 18483556.
Liu Y, Helms C, Liao W, Zaba LC, Duan S, Gardner J, Wise C, Miner A, Malloy MJ, Pullinger CR. A genome-wide association study of psoriasis and psoriatic arthritis identifies new disease loci. PLoS genetics 2008, 4(3):e1000041. 10.1371/journal.pgen.1000041, 2274885, 18369459.
Stokowski RP, Pant PV, Dadd T, Fereday A, Hinds DA, Jarman C, Filsell W, Ginger RS, Green MR, Ouderaa FJ. A genomewide association study of skin pigmentation in a South Asian population. American journal of human genetics 2007, 81(6):1119-1132. 10.1086/522235, 2276347, 17999355.
Parsons L, Haque E, Liu H. Subspace Clustering for high dimensional data: A review. Sigkdd Explorations 2004, 6(1):15.
Gao X, Starmer JD. AWclust: point-and-click software for non-parametric population structure analysis. BMC bioinformatics 2008, 9:77. 10.1186/1471-2105-9-77, 2253519, 18237431.
Lee C, Abdool A, Huang CH. PCA-based population structure inference with generic clustering algorithms. BMC bioinformatics 2009, 10(Suppl 1):S73. 10.1186/1471-2105-10-S1-S73, 2648762, 19208178.
Liu N, Zhao H. A non-parametric approach to population structure inference using multilocus genotypes. Human genomics 2006, 2(6):353-364.
Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic Subspace Clustering of High Dimensional Data for data mining applications. SIGMOD Record ACM Special Interest Group on Management of Data 1998, 27(2):94-105.
Golub GH, Van Loan FC. matrix computations. 1996, Baltimore: The Johns Hopkins University Press, 3.
Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver AE, Qi L, Gregersen PK. Analysis and application of European genetic substructure using 300 K SNP information. PLoS genetics 2008, 4(1):e4. 10.1371/journal.pgen.0040004, 2211544,2211544, 18208329.
Luca D, Ringquist S, Klei L, Lee AB, Gieger C, Wichmann HE, Schreiber S, Krawczak M, Lu Y, Styche A. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. American journal of human genetics 2008, 82(2):453-463. 10.1016/j.ajhg.2007.11.003, 2427172, 18252225.
Tibshirani RWG, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. Journal Royal Statistical Soc B 2001, 63:411-423.
Bezdec JC. Pattern Recognition with Fuzzy Objective Function Algorithms. 1981, New York: Plenum Press.
Download Structure 2.2. , http://pritch.bsd.uchicago.edu/software/structure2_2.html
Installing BAPS to XP/Windows 2000 systems. , http://web.abo.fi/fak/mnf/mate/jc/software/baps_xp.html
AWclust. , http://awclust.sourceforge.net/
Liang L, Zollner S, Abecasis GR. GENOME: a rapid coalescent-based whole genome simulator. Bioinformatics (Oxford, England) 2007, 23(12):1565-1567. 10.1093/bioinformatics/btm138, 17459963.
Ewens WJ. Mathematical Population Genetics. 1979, Berlin: Springer.
International HapMap Project. , http://hapmap.org
FTP site for downloading bovine SNPs. , ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/snp/Btau20040927
Bovine Genome Project. , http://www.hgsc.bcm.tmc.edu/projects/bovine/index.html
Shriver MD, Mei R, Parra EJ, Sonpar V, Halder I, Tishkoff SA, Schurr TG, Zhadanov SI, Osipova LP, Brutsaert TD. Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Human genomics 2005, 2(2):81-89.
Breeds of Livestock, Cattle: (Bos). , http://www.ansi.okstate.edu/breeds/cattle/
Reich D, Price AL, Patterson N. Principal component analysis of genetic data. Nature genetics 2008, 40(5):491-492. 10.1038/ng0508-491, 18443580.
Paschou P, Ziv E, Burchard EG, Choudhry S, Rodriguez-Cintron W, Mahoney MW, Drineas P. PCA-correlated SNPs for structure identification in worldwide human populations. PLoS genetics 2007, 3(9):1672-1686. 10.1371/journal.pgen.0030160, 1988848,1988848, 17892327.
Waples RS, Gaggiotti O. What is a population? An empirical evaluation of some genetic methods for identifying the number of gene pools and their degree of connectivity. Molecular ecology 2006, 15(6):1419-1439. 10.1111/j.1365-294X.2006.02890.x, 16629801.
Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL. Worldwide human relationships inferred from genome-wide patterns of variation. Science (New York, NY) 2008, 319(5866):1100-1104.
Guojun Gan CM, Jianhong W. Data Clustering: Theory, Algorithms, and Applications. 2007, SIAM (Society for Industrial and Applied Mathematics), Philadephia.
Tang H, Choudhry S, Mei R, Morgan M, Rodriguez-Cintron W, Burchard EG, Risch NJ. Recent genetic selection in the ancestral admixture of Puerto Ricans. American journal of human genetics 2007, 81(3):626-633. 10.1086/520769, 1950843, 17701908.