Tandem repeats are an important class of DNA repeats and much research has focused on their efficient identification, their use in DNA typing and fingerprinting, and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into groups or families based on sequence similarity so that their biological importance may be further explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by alignment. In this paper we evaluate five distance functions used to produce those alignments--Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is important to analyze and compare these functions because the choice of distance metric forms the core of any clustering algorithm. We employ a novel method to compare alignments and thereby compare the distance functions themselves. We rank the distance functions based on the cluster validation techniques--Average Cluster Density and Average Silhouette Width. Finally, we propose a multi-phase clustering method which produces good-quality clusters. In this study, we analyze clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans Chromosome III.
串联重复序列是一类重要的DNA重复序列,许多研究都集中在其高效鉴定、在DNA分型和指纹识别中的应用,以及它们在诸如亨廷顿病、强直性肌营养不良和脆性X智力低下等三核苷酸重复疾病中的致病作用。我们有兴趣根据序列相似性将串联重复序列聚类成组或家族,以便进一步探索它们的生物学重要性。为了对串联重复序列进行聚类,我们需要一个通过比对获得的成对距离的概念。在本文中,我们评估了用于产生这些比对的五种距离函数——一致性距离、欧几里得距离、詹森 - 香农散度、熵 - 表面距离和熵加权距离。分析和比较这些函数是很重要的,因为距离度量的选择是任何聚类算法的核心。我们采用一种新的方法来比较比对结果,从而比较距离函数本身。我们根据聚类验证技术——平均聚类密度和平均轮廓宽度对距离函数进行排名。最后,我们提出了一种多阶段聚类方法,该方法能产生高质量的聚类。在这项研究中,我们分析了来自五个序列的串联重复序列的聚类:人类3号、5号、10号和X染色体以及秀丽隐杆线虫三号染色体。