Abstract :
[en] During the past decades, population structure analysis has been playing an important role for stratifying populations and tracking back population ancestries. Population structure is mainly due to non-random mating between subgroups in a population because of various reasons, being of social, cultural, or geographical nature. Genetic structure in populations may also arise from known or unknown family relationships. Complex disease analyses, in case-control genetic association studies particularly, can be affected by so-called cryptic relatedness, which refers to unobserved ancestral relationships between study individuals. As population structure may confound results from genetic association studies and studies that aim to detect clinically relevant substructure in patients, its detection is highly relevant. Revealing population structure is really essential. Notably, removing unwanted population structure in molecular-based patient subtypes detection is likely to lead to subtle or fine-scale remaining structure.
In this thesis, we developed a novel genetic structure detection tool, hereafter referred to as IPCAPS, which can also be used as, or extended to, a tool for fine-scale reclassification of patients. IPCAPS utilizes a fixation index (FST) to measure the distance between clusters for iterative loop termination. An FST > 0.001 is typically seen as evidence for genetic differentiation between European populations. We also introduced a novel heuristic called EigenFit as one of the stopping criteria. Although our tool has been developed to easily accommodate multiple data types, we have illustrated the conception of IPCAPs and its performance on simulated and real-life data using panels of genome-wide SNP data. SNPs, standing for Single Nucleotide Polymorphisms, are the most common type of genetic variation among people. There are roughly 10 million of them.
We evaluated the performance of IPCAPS using a variety of simulation studies and simulation scenarios, including varying sample sizes, varying SNP panel sizes, the absence or presence of outliers, large or very small genetic separation between synthetic populations. The performance of IPCAPS was measured by estimating accuracy and computation time. We observed that our method generally outperformed a selection of other iterative pruning based methods such as ipPCA, iNJclust, and SHIPS. Also in the presence of outliers, IPCAPS' computation time is largely affected by sample size, not by the number of SNPs included in the analysis.
We furthermore validated our tools and proposed protocols on a variety of real-life datasets. These datasets differed in complexity and ranged from worldwide sample collections, over regional populations, to geographically confined samples. In particular, we analyzed data from the International HapMap Project, the 1000 Genomes Project, Africa and Thailand. We proposed a suitable protocol to correct for population stratification and to perform patient subgrouping in samples from the International IBD Genetics Consortium (IBD referring to inflammatory bowel disease). All developed analysis protocols involved guidelines for the interpretation of identified strata.
In conclusion, IPCAPS is a promising structure detection analysis tool. It was able to identify fine structure in African and HapMap populations, previously unreported. IPCAPS analysis also suggested the presence of at least 3 subtypes of Crohn’s disease and at least 3 subtypes of Ulcerative Colitis patients. More work is needed to evaluate the importance of these findings in clinical practice and for precisions medicine.