III:Small:Algorithms for Tandem Repeat Variant Discovery Using Next Generation Sequencing Data

III:Small：使用下一代测序数据发现串联重复变异的算法

基本信息

批准号：
1017621
负责人：
Gary Benson
金额：
$ 50万
依托单位：
Trustees of Boston University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2010
资助国家：
美国
起止时间：
2010-08-15 至 2015-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1017621&HistoricalAwards=false
关键词：
III Small Algorithms Tandem Repeat

项目摘要

A tandem repeat (TR) is any pattern of nucleotides which occurs as repeating, consecutive copies along a DNA molecule. Often, the pattern copies are not identical. A TR can be polymorphic, that is, it can be different across individuals in a population: 1) the number of copies may be different, 2) the arrangement of non-identical copies may be dfferent, and 3) the copies may contain different small mutations. TR variants are known to affect important biological processes, such as chromatin structure, gene plasticity, gene expression, and disease states, so their discovery is crucial for correctly understanding complex bio-molecular interactions. While a conservative estimate suggests that 100,000 human TRs may be polymorphic, until recently, genome-wide study of TR polymorphism, in humans and other organisms, has been too difficult and costly, with the result that the true extent of polymorphism and its effects are unknown. New genome sequencing technologies offer the first real opportunity to fill in the details of TR diversity. These technologies sequence millions of high quality, short DNA fragments in a singleexperiment. Current sequencing projects are producing many billions of reads rich in TR variant information. Yet, current read mapping algorithms,which attempt to assign each read to its proper location on the reference genome, are not designed to detect TR variants. This project has three central goals: 1. Algorithm Development; 2.Genome Studies; 3. Variation Curation in a public database. Strategies will be developed to accurately and efficiently map TR-containing reads to reference genome TR loci. Anticipated algorithmic developments include: 1) Optimization of tree-based alignment, for use when millions of short, disjoint sequences must be aligned to each other. The reads and references can each be merged into separate Patricia tree data structures and alignment computed between tree nodes, thereby eliminating redundant computation in the prefixes of the two sequence sets. 2) Production of space-saving, Burrows Wheeler transforms (BWT) of the most redundant tree parts by employing approximate shortest common superstrings (SCS) for the two sequence sets. 3) Development of an efficient Four-Russians style block computation for edit distance alignment in the trees by exploiting redundancy inherent in the small alphabet and block input scores, 4) Development of a bounding computation for edit-distance based on efficient, bit-register computation of longest common subsequence (LCS) alignment, and 5) Parallelization of all algorithms for further efficiency with multi-core processors, Single Instruction, Multiple Data (SIMD) bit-register computations, and highly parallel graphics processing units (GPUs). Data from six recently published whole human genomes, two human centenarian genomes, and the 1000 genomes project will be analyzed to discover TR variants. An internet-accessible, public database and analysis platform for curation and display of TR variants will be developed.The TR variant discovery software and all data sets produced will directly enhance the infrastructure for TR diversity research in genome biology, genome evolution, and comparative genomics. The software and data will be freely available to the research community through a high capacity website maintained in the PI's lab at Boston University. The PI will participate in a variety of activities that link research and education and support participation by members of underrepresented groups, including provision of opportunities in research for graduate and undergraduate students, participation in high school enrichment and curriculum development projects, and editorship of an international journal engaged in the dissemination of bioinformatics research.

串联重复（TR）是沿DNA分子的重复连续拷贝发生的任何模式。通常，图案副本不完全相同。 TR可以是多态性的，也就是说，在人群中的个体之间可能有所不同：1）副本的数量可能不同，2）非相同副本的排列可能是少数的，3）副本可能包含不同的小突变。已知TR变体会影响重要的生物学过程，例如染色质结构，基因可塑性，基因表达和疾病状态，因此它们的发现对于正确理解复杂的复杂生物分子相互作用至关重要。虽然保守的估计表明，有100,000人可能是多态性的，但直到最近，在人类和其他生物中，全基因组对三多态性的研究都太困难且昂贵，结果是多态性及其影响的真实程度尚不清楚。新的基因组测序技术为填写TR多样性的细节提供了第一个真正的机会。这些技术在单个经验中序列数百万个高质量的短DNA碎片。当前的测序项目正在产生数十亿个富含TR变体信息的读取。然而，当前的读取映射算法试图将每个读取为参考基因组上的适当位置，但并非旨在检测TR变体。该项目有三个中心目标：1。算法开发； 2.基因组研究； 3。公共数据库中的变化策划。将制定策略以准确有效地绘制含三键读数以参考基因组TR基因座。预期的算法开发包括：1）优化基于树木的对齐方式，以供数百万个短，不相交的序列相互对齐时使用。可以将读取和参考分别合并为单独的Patricia树数据结构和在树节点之间计算的对齐，从而消除了两个序列集的前缀中的冗余计算。 2）生产节省空间的，洞穴旋转器转换（BWT）是最冗余的树零件，通过对两个序列集使用大约最短的常见超级弦（SC）。 3）开发有效的四俄语风格的限制计算，通过利用小字母和块输入分数固有固有的冗余，4）开发基于有效的，比特的计算来开发编辑距离的界限计算，以对最长的常见分子（LCS）的均值（LCS）和5）的单身效率，以取得所有Algorith的效率，以取得所有al Algorith的效率。数据（SIMD）位注册计算和高度并行图形处理单元（GPU）。将分析来自六个最近发表的整个人类基因组的数据，两个人类百年基因组和1000个基因组项目将进行分析以发现TR变体。将开发一个可互联网，公共数据库和分析平台，用于策展和展示TR变体。TR变体发现软件和所有生产的数据集将直接增强基因组生物学，基因组进化和比较基因组学中TR多样性研究的基础架构。该软件和数据将通过波士顿大学PI实验室中的高容量网站免费提供给研究社区。 PI将参与各种活动，这些活动将研究和教育和支持群体成员的参与联系起来，包括为研究生和本科生提供研究机会，参与高中富集和课程发展项目的参与以及从事生物信息构成研究的国际期刊的编辑。