Single-nucleotide polymorphisms (SNPs) are vital in identifying genetic level variations in complex disease. It was found that the information of SNPs on adjacent or identical genes can be represented by a few tagSNPs (called tag SNP-set or tagSNP-set). In this work, we propose a novel method called TagSNP-set Selection by Optimal Iteration with Linkage Disequilibrium (TSOILD) and develop a quantificationally analytical tagSNP-set prediction method called Physical Distance-Linkage Disequi-librium Prediction Method (PDLDPM). To verify the validity of TSOILD method and PDLDPM, a large amount of test data is generated by simulation software HAPGEN2. According to the experimental results, the prediction accuracy of TSOILD is improved by 6.73%, 3.19%, 6.52% and 1.72% over the Random Sampling, Genetic Algorithm (GA) , Greedy Algorithm and TagSNP-Set Selection Method with Maximum Information (TSMI) respectively. In addition, PDLDPM, Linkage Coverage and selection of tag SNPs to maximize prediction accuracy (STAMPA) are used to evaluate the tagSNP-set selected by Random Sampling, GA, Greedy Algorithm and TSMI. Results show that the PDLDPM performs better than the other two methods. These methods provide effective assistance for the study of genetic level variation of complex diseases. (C) 2020 The Authors. Published by Elsevier B.V.
单核苷酸多态性(SNPs)在识别复杂疾病的基因水平变异方面至关重要。研究发现,相邻或相同基因上的SNPs信息可由少数标签单核苷酸多态性(称为标签SNP集或tagSNP - set)来表示。在这项工作中,我们提出了一种名为基于连锁不平衡的最优迭代标签SNP集选择(TSOILD)的新方法,并开发了一种定量分析的标签SNP集预测方法,即物理距离 - 连锁不平衡预测方法(PDLDPM)。为了验证TSOILD方法和PDLDPM的有效性,利用模拟软件HAPGEN2生成了大量测试数据。根据实验结果,TSOILD的预测准确率分别比随机抽样、遗传算法(GA)、贪心算法以及基于最大信息的标签SNP集选择方法(TSMI)提高了6.73%、3.19%、6.52%和1.72%。此外,使用PDLDPM、连锁覆盖度以及为使预测准确率最大化选择标签SNP(STAMPA)来评估由随机抽样、GA、贪心算法和TSMI所选择的标签SNP集。结果表明,PDLDPM的性能优于其他两种方法。这些方法为复杂疾病基因水平变异的研究提供了有效的帮助。(C)2020作者。由爱思唯尔出版社出版。