Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) < 5%) are not systemically assessed. With the emergence of next-generation sequencing, large reference panels (such as the 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF≤0.3%), only 0–1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.
基因型填补现在常规应用于全基因组关联研究(GWAS)和荟萃分析中。然而,大多数填补是使用国际人类基因组单体型图(HapMap)样本作为参考进行的,低频和罕见变异(次要等位基因频率(MAF)<5%)的填补没有得到系统性评估。随着新一代测序技术的出现,大型参考面板(如千人基因组面板)可用于促进这些变异的填补。因此,为了评估低频和罕见变异填补的性能,我们对153个个体进行了填补,每个个体有3种不同的基因型阵列数据,包括31.7万个、61万个和100万个单核苷酸多态性(SNP),将其填补到3个不同的参考面板:2010年3月发布的千人基因组试点数据(1KGpilot)、2010年8月发布的千人基因组临时数据(1KGinterim)以及2010年11月和2011年5月发布的千人基因组第一阶段数据(1KGphase1),使用的是IMPUTE版本2。这3个版本的千人基因组数据在样本量、祖先多样性、变异数量及其频谱方面存在差异。我们发现参考面板和GWAS芯片密度都会影响低频和罕见变异的填补。1KGphase1优于其他2个面板,在每个MAF区间内具有更高的一致性率、更高比例的良好填补变异(信息值>0.4)以及更高的平均信息得分。同样,100万芯片阵列优于61万和31.7万芯片阵列。然而,对于非常罕见的变异(MAF≤0.3%),只有0 - 1%的变异得到了良好填补。我们得出结论,随着参考面板增大和全基因组基因分型阵列密度提高,低频和罕见变异的填补效果会改善。然而,尽管参考面板规模大且基因分型密度高,非常罕见的变异仍然难以填补。