Improving overlap-finding techniques for whole genome shotgun data
改进全基因组鸟枪数据的重叠查找技术
基本信息
- 批准号:0312360
- 负责人:
- 金额:$ 9.94万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2003
- 资助国家:美国
- 起止时间:2003-07-15 至 2005-06-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Yorke A genome (the DNA in a cell) can be represented by asequence of letters called "bases." A large genome can consistof billions of bases. Chemical techniques allow scientists toread only a few hundred bases at a time. The whole genome shotgun(WGS) assembly technique creates a draft of the sequence of awhole genome by selecting such short fragments at random from thegenome, determining the sequence of the fragments, and thencomputationally re-assembling millions of these fragments. Twofragments are said to "overlap" if it is plausible that they comefrom the same part of the genome, based on a comparison of theirsequences. The goal of this project is to focus efforts onproducing an extremely robust set of overlaps, using acombination of sophisticated error-correction techniques, as wellas "localizing" fragments to validate overlaps by ensuring thatboth fragments come from the same vicinity of the genome.Several issues complicate the determination of which pairs offragments overlap. First, most genomes contain many "repeatregions," i.e., two or more almost identical copies of longstretches of sequence. Thus, two fragments that do not actuallyoverlap may look like they do. Second, the random samplingtechnique results in many base errors --- bases can be mis-reador missed entirely. These errors, combined with the fact thatrepeat regions usually differ slightly, make it very difficult todistinguish a spurious overlap from a true overlap in which oneor both fragments contain read errors. Thus, if extreme care isnot taken, it is easy to use a spurious overlap and therebymistakenly connect distant parts of the genome. Preliminaryresults in collaboration with Celera Genomics, the Baylor Collegeof Medicine, and The Institute for Genomic Research (TIGR) havedemonstrated that the investigator's current techniques canalready produce more sequence at higher quality. The goal isimprove these techniques and make them widely available. The determination and interpretation of genetic informationis one of the great challenges of the twenty-first century. Thegenome, i.e., all the DNA in a cell, is the molecular basis ofdiversity and the cornerstone of genetic information. Draftgenomes have been obtained for human, mouse, and some insects,fish, plants, and bacteria. This is a start, but a fullunderstanding of biological processes cannot be had by studyingthe genomes of only a handful of species. The federal governmentis spending about 100 million dollars per year generatingsequence data. Millions of small pieces of a genome are sampledfrom the genome. The second stage is called "assembly," whenthese pieces are re-assembled on a computer like a giant jigsawpuzzle. The puzzle is complicated by two facts: first, many ofthe puzzle pieces have small errors that make them mis-fitagainst pieces that they SHOULD fit with; and second, many piecesthat should NOT go together actually fit together quite well.This makes it extremely difficult to correctly assemble a genome.There are two ways to decrease the ambiguities: first, one couldgenerate more pieces. However, each new piece costs about $2,and one would need to generate millions of new pieces to have asignificant effect on assembly quality. The investigators use asecond route. They attempt to squeeze as much information out ofthe existing pieces as possible. The latter route issubstantially cheaper, and there is still much room forimprovement here over existing techniques. The investigators areusing sophisticated mathematics to help discern with extremeprecision those pairs of pieces that do, and those that do not,fit together. Preliminary results of the investigators -- incollaboration with several large sequencing centers -- havedemonstrated that using their techniques to "pre-process" thepieces can produce more of the genome, with fewer errors. Thisproject aims at extending these ideas further and making themfreely accessible to all investigators. The impact on the federalgenome (biotechnology) projects is potentially great.
约克 基因组(细胞中的 DNA)可以用称为“碱基”的字母序列来表示。 一个大的基因组可以由数十亿个碱基组成。 化学技术使科学家一次只能读取几百个碱基。 全基因组鸟枪(WGS)组装技术通过从基因组中随机选择这样的短片段,确定片段的序列,然后通过计算重新组装数百万个这样的片段,从而创建整个基因组序列的草图。 如果根据序列比较,两个片段似乎来自基因组的同一部分,则称它们“重叠”。 该项目的目标是集中精力产生一组极其强大的重叠,结合使用复杂的纠错技术,以及“本地化”片段以通过确保两个片段来自基因组的同一附近来验证重叠。几个问题使确定哪些片段对重叠变得复杂。 首先,大多数基因组包含许多“重复区域”,即长序列的两个或多个几乎相同的副本。 因此,实际上不重叠的两个片段可能看起来像重叠的。 其次,随机采样技术会导致许多碱基错误——碱基可能会被误读或完全漏掉。 这些错误,加上重复区域通常略有不同的事实,使得很难区分虚假重叠和真实重叠,其中一个或两个片段都包含读取错误。 因此,如果不格外小心,很容易使用虚假重叠,从而错误地连接基因组的遥远部分。 与 Celera Genomics、贝勒医学院和基因组研究所 (TIGR) 合作的初步结果表明,研究人员目前的技术已经可以产生更多、更高质量的序列。 目标是改进这些技术并使其广泛使用。 遗传信息的确定和解释是二十一世纪的巨大挑战之一。 基因组,即细胞中的所有DNA,是多样性的分子基础,也是遗传信息的基石。 人类、小鼠和一些昆虫、鱼类、植物和细菌的基因组草案已经获得。 这是一个开始,但仅研究少数物种的基因组无法全面了解生物过程。 联邦政府每年花费约 1 亿美元生成序列数据。 从基因组中采样了数百万个基因组小片段。 第二阶段称为“组装”,这些部件像巨型拼图一样在计算机上重新组装。 这个拼图因两个事实而变得复杂:首先,许多拼图块都有小错误,使它们与它们应该适合的拼图块不匹配;其次,许多不应该组合在一起的片段实际上可以很好地组合在一起。这使得正确组装基因组变得极其困难。有两种方法可以减少歧义:第一,可以生成更多片段。 然而,每个新部件的成本约为 2 美元,并且需要生产数百万个新部件才能对装配质量产生重大影响。 调查人员使用第二条路线。 他们试图从现有的片段中榨取尽可能多的信息。 后一种路线要便宜得多,并且与现有技术相比仍有很大的改进空间。 研究人员正在使用复杂的数学来帮助极其精确地辨别那些匹配和不匹配的部件。 研究人员与几个大型测序中心合作的初步结果表明,使用他们的技术“预处理”片段可以产生更多的基因组,并且错误更少。 该项目旨在进一步扩展这些想法,并使所有研究人员可以自由获取它们。 对联邦基因组(生物技术)项目的影响可能很大。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
James Yorke其他文献
James Yorke的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('James Yorke', 18)}}的其他基金
Mathematical Modeling of DNA Repeats and HIV Epidemics
DNA 重复和 HIV 流行的数学模型
- 批准号:
0616585 - 财政年份:2006
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Chaos with Multiple Positive Lyapunov Exponents
具有多个正李亚普诺夫指数的混沌
- 批准号:
9870183 - 财政年份:1998
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Mathematical Sciences: "Chaos with Multiple Positive Lyapunov Exponents
数学科学:“具有多个正李雅普诺夫指数的混沌
- 批准号:
9423843 - 财政年份:1995
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Attractor Reconstruction from Experimental Data
根据实验数据重建吸引子
- 批准号:
9116391 - 财政年份:1992
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Mathematical Sciences: Bifurcation and Global Continuation
数学科学:分岔和全局延拓
- 批准号:
8117967 - 财政年份:1982
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Qualitative Behavior For Generalized Dynamical Processes
广义动态过程的定性行为
- 批准号:
7818221 - 财政年份:1979
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Qualitative Behavior For Generalized Dynamical Processes
广义动态过程的定性行为
- 批准号:
7624432 - 财政年份:1976
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Qualitative Behavior For Generalized Dynamical Processes
广义动态过程的定性行为
- 批准号:
7424310 - 财政年份:1974
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
相似国自然基金
社交化学习环境下面向动态异质学习者关系网络的重叠社区发现方法研究
- 批准号:62077045
- 批准年份:2020
- 资助金额:48 万元
- 项目类别:面上项目
融合上下文信息和重叠社区发现的个性化位置推荐方法研究
- 批准号:61806083
- 批准年份:2018
- 资助金额:25.0 万元
- 项目类别:青年科学基金项目
异质多社交网络信息融合与热点事件多维演化
- 批准号:61772133
- 批准年份:2017
- 资助金额:65.0 万元
- 项目类别:面上项目
基于图聚集技术的微博用户重叠社区发现方法研究
- 批准号:61762078
- 批准年份:2017
- 资助金额:39.0 万元
- 项目类别:地区科学基金项目
基于主动异构监督的重叠社区发现及其模型选择方法研究
- 批准号:61503281
- 批准年份:2015
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
相似海外基金
Exploring the overlap between neurodevelopmental disorders and traits with adolescent hypomania
探索神经发育障碍和青少年轻躁狂特征之间的重叠
- 批准号:
2886920 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Studentship
The cardiovascular consequences of sleep apnea plus COPD (Overlap syndrome)
睡眠呼吸暂停加慢性阻塞性肺病(重叠综合征)对心血管的影响
- 批准号:
10733384 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Domestic Abuse Proceedings In Family Courts: Overlap And Pathways In Private And Public Family Justice
家庭法院的家庭暴力诉讼:私人和公共家庭司法的重叠和途径
- 批准号:
ES/X011399/1 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Fellowship
Integrating Epidemiologic and Genomic Data to Elucidate the Genetic Overlap Between Congenital Anomalies and Pediatric Cancer
整合流行病学和基因组数据来阐明先天性异常和儿童癌症之间的遗传重叠
- 批准号:
10749761 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
The Changing Structure of the International Court of Justice: Overlap of Dispute Settlement and International Control
国际法院结构的变化:争端解决与国际控制的重叠
- 批准号:
23K01112 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Grant-in-Aid for Scientific Research (C)