CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences

CD-HIT：对大量生物序列进行聚类和比较的快速程序

基本信息

批准号：
7682840
负责人：
Weizhong Li
金额：
$ 27.04万
依托单位：
UNIVERSITY OF CALIFORNIA, SAN DIEGO
依托单位国家：
美国
项目类别：
财政年份：
2008
资助国家：
美国
起止时间：
2008-09-01 至 2011-06-30
项目状态：
已结题

项目摘要

DESCRIPTION (provided by applicant): Project Summary/Abstract CD-HIT is a computer program for clustering and comparing large sets of protein or nucleotide sequences. It helps to significantly reduce the computational and manual efforts in various sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset. CD-HIT is 2 to 3 orders of magnitude faster than other methods. It can handle extremely large databases and has been used extensively in various fields. CD-HIT is becoming increasingly popular based on users' feedback and the growing number of publications that cited CD-HIT. CD-HIT has thousands of users now and is routinely used in many popular databases, such as UniProt and PDB. Researchers are now facing serious challenges and problems from the explosive growth of public sequence databases as a result of high-throughput genome sequencing projects and the very recent environmental metagenomic projects. The routine analysis, from searching a database to building a multiple alignment, is getting more computational expensive and complicated. An efficient clustering method is crucial to address many of the challenges and help researchers to overcome the problems. Currently, no other available program can replace CD-HIT in terms of speed and the ability to handle very large datasets. Therefore, CD-HIT will be playing a more important role in the future. The goal of this proposal is the further improvement and development of the CD-HIT program and related applications to better serve the increasing user community and to address the issues raised by users of CD-HIT. The algorithm will be improved to achieve better performance and overcome the existing limitations. Efforts will be spent towards more accurate clustering results while still maintaining the ultrahigh speed. New functions will be implemented to meet various clustering and comparing needs. More enhanced maintenance and better software engineering techniques will take place to provide regular program releases and updates, better portability, shorter trouble shooting cycles, and richer documentation. Subject to University policies, CD-HIT will be continually an open source package. In addition, a web server will be set up for easier public access to CD-HIT's applications. The server will provide further analysis and visualization tools, interface and links to other bioinformatics resources. Pre-calculated popular datasets will be made available to the public to eliminate the need for individual labs to repeat the same work. Project Narrative CD-HIT is a fast computer program for clustering and comparing biological sequences used by thousands of researchers in public health related studies. It directly helps researchers to significantly reduce the efforts in sequence analysis and to correct the bias within public databases. Continued development of CD-HIT will better serve researchers who are facing more challenges in sequence analysis by the explosive growth of public sequence databases.

描述（由申请人提供）：项目摘要/摘要 CD-HIT 是一个用于聚类和比较大量蛋白质或核苷酸序列的计算机程序。它有助于显着减少各种序列分析任务中的计算和手动工作，并有助于理解数据结构并纠正数据集中的偏差。 CD-HIT 比其他方法快 2 到 3 个数量级。它可以处理非常大的数据库，并已广泛应用于各个领域。根据用户的反馈以及引用 CD-HIT 的出版物数量的增加，CD-HIT 变得越来越受欢迎。 CD-HIT 现在拥有数千名用户，并且经常用于许多流行的数据库，例如 UniProt 和 PDB。由于高通量基因组测序项目和最近的环境宏基因组项目，公共序列数据库爆炸式增长，研究人员现在面临着严峻的挑战和问题。从搜索数据库到构建多重比对，常规分析的计算成本越来越高，也越来越复杂。有效的聚类方法对于解决许多挑战并帮助研究人员克服问题至关重要。目前，在速度和处理超大数据集的能力方面，没有其他可用程序可以取代 CD-HIT。因此，CD-HIT在未来将会发挥更加重要的作用。该提案的目标是进一步改进和开发CD-HIT 程序及相关应用程序，以更好地服务于不断增长的用户社区并解决CD-HIT 用户提出的问题。该算法将得到改进，以实现更好的性能并克服现有的限制。我们将在保持超高速的同时，努力获得更准确的聚类结果。将实现新的功能来满足各种聚类和比较的需求。将进行更多增强的维护和更好的软件工程技术，以提供定期的程序发布和更新、更好的可移植性、更短的故障排除周期和更丰富的文档。根据大学政策，CD-HIT 将继续作为开源包。此外，还将建立一个网络服务器，以便公众更轻松地访问 CD-HIT 的应用程序。该服务器将提供进一步的分析和可视化工具、界面以及其他生物信息学资源的链接。预先计算的流行数据集将向公众开放，以消除各个实验室重复相同工作的需要。 Project Narrative CD-HIT 是一个快速计算机程序，用于聚类和比较生物序列，数千名研究人员在公共卫生相关研究中使用。它直接帮助研究人员显着减少序列分析的工作量并纠正公共数据库中的偏差。 CD-HIT的持续发展将更好地服务于因公共序列数据库爆炸式增长而在序列分析方面面临更多挑战的研究人员。