The Terabase Search Engine
Terabase 搜索引擎
基本信息
- 批准号:8882493
- 负责人:
- 金额:$ 34.61万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2014
- 资助国家:美国
- 起止时间:2014-07-01 至 2017-04-30
- 项目状态:已结题
- 来源:
- 关键词:AccelerationAffectAlgorithmsArchivesChromosome StructuresCodeCommunitiesComplexComputational algorithmComputer softwareComputersCoupledDNA SequenceDNA Sequence DatabasesDataData CompressionData SetDatabasesDepositionDiseaseGalaxyGenesGenomeGoalsHealthHumanHuman GeneticsHuman GenomeInfectious Diseases ResearchInvestigationModelingMolecularMutationPositioning AttributeProcessReaction TimeReadingReal-Time SystemsResearchResearch PersonnelResourcesRetrievalRunningScientistSequence AnalysisServicesSiteSolutionsSorting - Cell MovementSpeedSystemTimeValidationVariantWritingbasecloud baseddesignhuman DNA sequencinghuman diseasehuman genome sequencingindexinginstrumentmicrobialnext generation sequencingnovel strategiesopen sourceprogramstooltraituser-friendlyweb interface
项目摘要
DESCRIPTION (provided by applicant): We propose to create a new system, the Terabase Search Engine that will make it possible for biomedical researchers to search all human DNA sequences that have been sequenced and deposited in public archives. The vast and growing resource of human DNA sequences provides a wealth of opportunities for scientific discovery and for validation of results, but the size of the data sets has already far exceeded the ability o most researchers to use them. For more than two decades, geneticists and geneticists have relied on DNA sequence databases for a wide range of scientific endeavors, including the discovery of new genes and new mutations, the investigation of evolutionary changes within and between species, the forces affecting chromosomal structure and change, and many other molecular and evolutionary processes. The ability to search all known genes and genomes using BLAST and similar programs has long been assumed, and sequence search engines throughout the world provide this ability. However, the raw data pouring out of next-generation sequencing (NGS) projects has exceeded our ability to provide rapid access to it. A single NGS instrument can generate six billion reads encompassing 600 billion bases in a single run, and this capacity is still growing. Traditional alignment programs like BLAST cannot sort through this data in a reasonable amount of time. Newer, faster programs such as Bowtie (developed by our group) allow far faster alignment of NGS reads to the genome, but today the size of the data sets, now in excess of 1 trillion reads, far exceeds the ability of most computers to store it. And
even the fastest alignment programs today could not search all this data in a reasonable amount of time. A new approach is required in order to serve up these huge and hugely valuable DNA sequences to the research community. The Terabase Search Engine will be a new, highly efficient system for searching trillions of bases in real time. Using a hierarchical search strategy with extensive pre-processing to speed up response time, the TSE will allow a scientist to align any sequence, human or non-human, to all publicly-available human sequence reads. Reads that match the human genome will be indexed and stored on very high-speed disks for rapid retrieval. Reads that match microbial sequences will be captured and stored separately for use in micro biome and infectious disease research. The system will be made available through a user-friendly web interface, and a local database will store each user's results for further analysis on the TSE site or for download to a local site. This system will make
it possible, for the first time ever, for any scientist to align a sequence to the complete set of human DNA sequences and to retrieve everything that matches, without the need to write special-purpose programs or to use complex cloud-based software interfaces. All of the software for this project will be developed under an open-source model that will permit others to use, modify, share, and re-distribute the code without restriction.
描述(由申请人提供):我们建议创建一个新系统,即terabase搜索引擎,这将使生物医学研究人员可以搜索已将并存放在公共档案中的所有人类DNA序列。人类DNA序列的庞大而不断增长的资源为科学发现和验证结果提供了丰富的机会,但是数据集的规模已经超出了大多数研究人员使用它们的能力。二十多年来,遗传学家和遗传学家一直依赖DNA序列数据库,用于广泛的科学努力,包括发现新基因和新突变,物种内部和物种之间进化变化的研究,影响染色体结构和变化的力以及许多其他分子和进化过程。长期以来,已经假定了使用BLAST和类似程序搜索所有已知基因和基因组的能力,并且全世界的序列搜索引擎都提供了这种能力。但是,从下一代测序(NGS)项目中涌出的原始数据超出了我们提供快速访问的能力。单个NGS仪器可以在一次运行中产生60亿个读取,其中包含6000亿个基础,并且这种能力仍在增长。像BLAST这样的传统对齐程序无法在合理的时间内整理这些数据。诸如Bowtie(由我们的小组开发)之类的更新的程序允许NGS读取的速度更快地对齐基因组,但是如今,数据集的大小,如今已超过1万亿读的读取,远远超过了大多数计算机存储它的能力。和
即使是当今最快的对齐程序,也无法在合理的时间内搜索所有这些数据。为了为研究社区提供这些巨大且非常有价值的DNA序列,需要采用一种新的方法。 Terabase搜索引擎将是一个新的,高效的系统,用于实时搜索数万亿个基础。使用层次搜索策略和广泛的预处理以加快响应时间,TSE将允许科学家将任何序列(人类或非人类)与所有公共可用的人类序列读取。与人类基因组相匹配的读物将被索引并存储在非常高速的磁盘上,以快速检索。读取匹配微生物序列的读物将被捕获并分别存储在微型生物群落和传染病研究中。该系统将通过用户友好的Web界面提供,本地数据库将存储每个用户的结果以在TSE站点上进行进一步分析或下载到本地站点。这个系统将使
有史以来,任何科学家都有可能将序列与完整的人类DNA序列保持一致,并取回与之匹配的所有内容,而无需编写特殊用途程序或使用复杂的基于云的软件接口。该项目的所有软件都将在开源模型下开发,该模型将允许其他人使用,修改,共享和重新分配代码而无需限制。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Steven L. Salzberg其他文献
Q UALITY ASSESSMENT OF SPLICE SITE ANNOTATION BASED ON CONSERVATION ACROSS MULTIPLE SPECIES
基于多物种保护的剪接位点注释质量评估
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
Ilia Minkin;Steven L. Salzberg - 通讯作者:
Steven L. Salzberg
Steven L. Salzberg的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Steven L. Salzberg', 18)}}的其他基金
Comprehensive Human Expressed Sequences in Brain (CHESS-BRAIN) and their roles in neuropsychiatric illness
大脑中综合人类表达序列(CHESS-BRAIN)及其在神经精神疾病中的作用
- 批准号:
10541887 - 财政年份:2021
- 资助金额:
$ 34.61万 - 项目类别:
Comprehensive Human Expressed Sequences in Brain (CHESS-BRAIN) and their roles in neuropsychiatric illness
大脑中综合人类表达序列(CHESS-BRAIN)及其在神经精神疾病中的作用
- 批准号:
10362615 - 财政年份:2021
- 资助金额:
$ 34.61万 - 项目类别:
Comprehensive Human Expressed Sequences in Brain (CHESS-BRAIN) and their roles in neuropsychiatric illness
大脑中综合人类表达序列(CHESS-BRAIN)及其在神经精神疾病中的作用
- 批准号:
10205617 - 财政年份:2021
- 资助金额:
$ 34.61万 - 项目类别:
Computational Methods for Microbial and Microbiome Sequence Analysis
微生物和微生物组序列分析的计算方法
- 批准号:
10331733 - 财政年份:2019
- 资助金额:
$ 34.61万 - 项目类别:
Computational Methods for Microbial and Microbiome Sequence Analysis
微生物和微生物组序列分析的计算方法
- 批准号:
10550160 - 财政年份:2019
- 资助金额:
$ 34.61万 - 项目类别:
Computational Methods for Microbial and Microbiome Sequence Analysis
微生物和微生物组序列分析的计算方法
- 批准号:
10083744 - 财政年份:2019
- 资助金额:
$ 34.61万 - 项目类别:
Computational Gene Modeling and Genome Sequence Assembly
计算基因建模和基因组序列组装
- 批准号:
8329127 - 财政年份:2011
- 资助金额:
$ 34.61万 - 项目类别:
Alignment Software for Second-Generation Sequencing
用于第二代测序的比对软件
- 批准号:
8068060 - 财政年份:2011
- 资助金额:
$ 34.61万 - 项目类别:
Alignment Software for Second-Generation Sequencing
用于第二代测序的比对软件
- 批准号:
8464182 - 财政年份:2011
- 资助金额:
$ 34.61万 - 项目类别:
相似国自然基金
基于先进算法和行为分析的江南传统村落微气候的评价方法、影响机理及优化策略研究
- 批准号:52378011
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
社交网络上观点动力学的重要影响因素与高效算法
- 批准号:62372112
- 批准年份:2023
- 资助金额:50.00 万元
- 项目类别:面上项目
员工算法规避行为的内涵结构、量表开发及多层次影响机制:基于大(小)数据研究方法整合视角
- 批准号:72372021
- 批准年份:2023
- 资助金额:40 万元
- 项目类别:面上项目
算法人力资源管理对员工算法应对行为和工作绩效的影响:基于员工认知与情感的路径研究
- 批准号:72372070
- 批准年份:2023
- 资助金额:40 万元
- 项目类别:面上项目
算法鸿沟影响因素与作用机制研究
- 批准号:72304017
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Bioethical, Legal, and Anthropological Study of Technologies (BLAST)
技术的生物伦理、法律和人类学研究 (BLAST)
- 批准号:
10831226 - 财政年份:2023
- 资助金额:
$ 34.61万 - 项目类别:
Exploratory Analysis Tools for Developmental Studies of Brain Microstructure with Diffusion MRI
利用扩散 MRI 进行脑微结构发育研究的探索性分析工具
- 批准号:
10645844 - 财政年份:2023
- 资助金额:
$ 34.61万 - 项目类别:
GPU-based SPECT Reconstruction Using Reverse Monte Carlo Simulations
使用反向蒙特卡罗模拟进行基于 GPU 的 SPECT 重建
- 批准号:
10740079 - 财政年份:2023
- 资助金额:
$ 34.61万 - 项目类别:
Mechanisms by which PIM kinase modulates the effector function of autoreactive CD8 T cells in type 1 diabetes
PIM 激酶调节 1 型糖尿病自身反应性 CD8 T 细胞效应功能的机制
- 批准号:
10605431 - 财政年份:2023
- 资助金额:
$ 34.61万 - 项目类别:
Extending Reach, Accuracy, and Therapeutic Capabilities: A Soft Robot for Peripheral Early-Stage Lung Cancer
扩大范围、准确性和治疗能力:用于周围早期肺癌的软机器人
- 批准号:
10637462 - 财政年份:2023
- 资助金额:
$ 34.61万 - 项目类别: