Population stratification can cause spurious associations in a genome-wide association study (GWAS), and occurs when differences in allele frequencies of single nucleotide polymorphisms (SNPs) are due to ancestral differences between cases and controls rather than the trait of interest. Principal components analysis (PCA) is the established approach to detect population substructure using genome-wide data and to adjust the genetic association for stratification by including the top principal components in the analysis. An alternative solution is genetic matching of cases and controls that requires, however, well defined population strata for appropriate selection of cases and controls.
We developed a novel algorithm to cluster individuals into groups with similar ancestral backgrounds based on the principal components computed by PCA. We demonstrate the effectiveness of our algorithm in real and simulated data, and show that matching cases and controls using the clusters assigned by the algorithm substantially reduces population stratification bias. Through simulation we show that the power of our method is higher than adjustment for PCs in certain situations.
In addition to reducing population stratification bias and improving power, matching creates a clean dataset free of population stratification which can then be used to build prediction models without including variables to adjust for ancestry. The cluster assignments also allow for the estimation of genetic heterogeneity by examining cluster specific effects.
群体分层在全基因组关联研究(GWAS)中可能导致虚假关联,当单核苷酸多态性(SNP)的等位基因频率差异是由于病例组和对照组之间的祖先差异而非所关注的性状引起时,就会发生群体分层。主成分分析(PCA)是利用全基因组数据检测群体亚结构并通过在分析中纳入主要主成分来调整遗传关联以消除分层影响的既定方法。一种替代解决方案是对病例组和对照组进行遗传匹配,然而,这需要明确界定的群体分层以便恰当地选择病例组和对照组。
我们开发了一种新算法,基于PCA计算出的主成分将个体聚类为具有相似祖先背景的群体。我们在真实数据和模拟数据中证明了我们算法的有效性,并表明使用该算法所划分的聚类来匹配病例组和对照组可大幅降低群体分层偏差。通过模拟我们表明,在某些情况下我们方法的效能高于对主成分的调整。
除了降低群体分层偏差和提高效能外,匹配还创建了一个无群体分层的纯净数据集,该数据集随后可用于构建预测模型,而无需纳入用于调整祖先因素的变量。聚类分配还允许通过检查聚类特异性效应来估计遗传异质性。