PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
The online version of this article (doi:10.1186/s13742-015-0047-8) contains supplementary material, which is available to authorized users.
PLINK 1是一种广泛使用的开源C/C++工具集,用于全基因组关联研究(GWAS)和群体遗传学研究。然而,来自填补和全基因组测序研究的数据的不断积累,凸显了对关键功能(如逻辑回归、连锁不平衡估计和基因组距离评估)更快且可扩展实现的强烈需求。此外,GWAS和群体遗传学数据现在经常包含基因型似然值、相位信息和/或多等位基因变体,而PLINK 1的主要数据格式无法表示这些内容。
为了解决这些问题,我们正在为PLINK开发第二代代码库。这个代码库的第一个主要版本PLINK 1.9广泛使用了位级并行性、线性时间/常数空间的哈迪 - 温伯格平衡和费舍尔精确检验以及许多其他算法改进。综合起来,这些改变将大多数操作的速度提高了1 - 4个数量级,并使程序能够处理大到无法装入内存的数据集。我们还开发了一种数据格式的扩展,它增加了对基因型似然值、相位、多等位基因变体以及参考等位基因与替代等位基因的低开销支持,这是我们计划的第二个版本(PLINK 2.0)的基础。
PLINK的第二代版本将在性能和兼容性方面有显著提高。首次,无法使用高端计算资源的用户能够对正在使用的特征丰富且非常大的遗传数据集进行几种基本分析。
本文的在线版本(doi:10.1186/s13742 - 2015 - 0047 - 8)包含补充材料,授权用户可获取。