Scaling up computational genomics with tree sequences
用树序列扩展计算基因组学
基本信息
- 批准号:10471496
- 负责人:
- 金额:$ 55.68万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-09-24 至 2023-08-31
- 项目状态:已结题
- 来源:
- 关键词:AddressAffectAlgorithmic SoftwareAlgorithmsArchitectureAreaBase SequenceCollectionCommunitiesComplexComputer softwareComputing MethodologiesDataData CompressionData SetDevelopmentDiseaseEcologyEnsureEpidemiologyEtiologyEvolutionGenealogical TreeGenealogyGenerationsGeneticGenetic ProcessesGenetic RecombinationGenetic VariationGenomeGenomicsGenotypeGoalsHaplotypesHealthHealth BenefitHumanHuman GeneticsHuman GenomeIndividualInternetLibrariesMapsMethodsModelingModernizationMutationPerformancePhasePhenotypePopulationPopulation GeneticsPopulation SizesPositioning AttributeProcessProductionRecording of previous eventsRecordsResearchRunningSample SizeSamplingStatistical Data InterpretationStructureTestingTimeTrainingTreesTsunamiValidationVariantWorkalgorithm developmentbasecomputer frameworkcostdata formatdata reusedata structuredeep learningdesignexperiencefrontiergenome-widegenomic datahuman diseaseimprovedinteroperabilitylearning strategymembermulticore processornext generationnovelnovel strategiesopen sourceoperationscale upsequence learningsimulationstatisticsstructural genomicssuccesssupervised learningwhole genome
项目摘要
Project Summary/Abstract
Increasing sample size is a tremendously important factor in building our understanding of the genetics of
human disease. As we discover that more and more diseases have a complex web of genetic causation, we
need larger and larger genetic datasets to disentangle them, and to ultimately produce successful therapies.
Driven in part by this need, the community is now assembling vast collections of human genome sequences,
and millions of samples will soon be commonplace. Nonhuman datasets, with applications in epidemiology,
ecology, and evolution, will not be far behind. There is a profound problem, however: our computational
methods for storing, processing, simulating, and analyzing genomic data are lagging far behind our ability to
collect such data. The algorithms and data structures underlying today's computational methods were designed
for thousands of samples, not millions, and we are in danger of being overwhelmed by the impending tsunami
of data. Without a fundamental change in how we store and process genomic data, we will either not fully tap
the potential of the data we collect, or the computational costs will be astronomical – or both.
Our proposal addresses this critical need by focusing on a new data structure: the succinct tree sequence.
This data structure (the “tree sequence”, for brevity) encodes genetic variation data using the population ge-
netics processes that produced the data itself – by representing variation among contemporary samples via
mutations on the branches of the underlying genealogical trees. This yields extraordinary levels of data com-
pression, with file sizes hundreds of times smaller than current community standards. Since the tree sequence
was introduced in 2016 it has led to performance increases of 2–4 orders of magnitude in the diverse applica-
tions of genome simulation, calculation of statistics, and ancestry inference. Such sudden leaps in computa-
tional performance are vanishingly rare, and only possible through deep algorithmic advances.
Our research plan builds on the extraordinary successes of tree sequence methods so far, scaling up three
crucial layers of computational genomics: analysis, simulation, and inference. First, we will continue our
development of highly efficient tree-sequence-based methods for fundamental operations in statistical and
population genetics. Second, we will scale up genome simulations by integrating tree sequence methods into
complex forward-time simulations, utilizing modern, multicore processors. Third, we will combine efficient
genome simulations with cutting-edge deep-learning methods to improve existing inference methods, both
of tree sequences from genomic data, and of population parameters from novel tree-sequence encodings of
genotype data. Together, we aim to revolutionize the way we work with population genetic variation data, and
how we use it to understand human health and evolutionary processes.
Our experienced, interdisciplinary team is committed to producing rigorously tested and validated software
and accessible, interoperable, and reusable data formats through inclusive and open development.
项目概要/摘要
增加样本量是建立我们对遗传学的理解的一个极其重要的因素
当我们发现越来越多的疾病具有复杂的遗传因果关系网络时,我们
需要越来越大的基因数据集来解开它们,并最终产生成功的治疗方法。
在这种需求的推动下,社区现在正在收集大量的人类基因组序列,
数以百万计的样本将很快变得普遍,并应用于流行病学,
然而,生态学和进化论也不会落后太多,但有一个深刻的问题:我们的计算能力。
存储、处理、模拟和分析基因组数据的方法远远落后于我们的能力
设计了当今计算方法的算法和数据结构。
需要数千个样本,而不是数百万个样本,我们面临着被即将到来的海啸淹没的危险
如果我们存储和处理基因组数据的方式没有发生根本性的改变,我们要么无法充分利用数据。
我们收集的数据的潜力,或者计算成本将是天文数字——或者两者兼而有之。
我们的提案通过关注一种新的数据结构来解决这一关键需求:简洁的树序列。
该数据结构(为简洁起见,称为“树序列”)使用群体基因编码遗传变异数据
产生数据本身的网络过程——通过表示当代样本之间的变化
潜在遗传树分支上的突变会产生非凡的数据水平。
压缩,文件大小比当前社区标准小数百倍。
于 2016 年推出,它使各种应用程序的性能提高了 2-4 个数量级
基因组模拟、统计计算和祖先推断的突然飞跃。
性能表现极其罕见,只有通过深入的算法进步才有可能实现。
我们的研究计划建立在树序列方法迄今为止取得的非凡成功的基础上,扩大了三个
计算基因组学的关键层:分析、模拟和推理 首先,我们将继续我们的研究。
开发基于树序列的高效方法,用于统计和统计中的基本操作
其次,我们将通过将树序列方法整合到基因组模拟中来扩大基因组模拟的规模。
第三,我们将利用现代多核处理器进行复杂的前向时间模拟。
使用尖端深度学习方法进行基因组模拟,以改进现有的推理方法,
来自基因组数据的树序列,以及来自新的树序列编码的群体参数
我们共同致力于彻底改变我们处理群体遗传变异数据的方式,以及
我们如何使用它来了解人类健康和进化过程。
我们经验丰富的跨学科团队致力于生产经过严格测试和验证的软件
通过包容性和开放性的开发,实现可访问、可互操作和可重用的数据格式。
项目成果
期刊论文数量(7)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A general and efficient representation of ancestral recombination graphs.
祖先重组图的通用且有效的表示。
- DOI:
- 发表时间:2024-04-23
- 期刊:
- 影响因子:0
- 作者:Wong, Yan;Ignatieva, Anastasia;Koskela, Jere;Gorjanc, Gregor;Wohns, Anthony Wilder;Kelleher, Jerome
- 通讯作者:Kelleher, Jerome
Looking forwards and backwards: dynamics and genealogies of locally regulated populations.
向前和向后展望:当地管制人口的动态和谱系。
- DOI:
- 发表时间:2023-12-30
- 期刊:
- 影响因子:0
- 作者:Etheridge, Alison M;Letter, Ian;Kurtz, Thomas G;Ralph, Peter L;Ho Lung, Terence Tsui
- 通讯作者:Ho Lung, Terence Tsui
Genetic architecture, spatial heterogeneity, and the coevolutionary arms race between newts and snakes.
遗传结构、空间异质性以及蝾螈和蛇之间的共同进化军备竞赛。
- DOI:
- 发表时间:2024-03-01
- 期刊:
- 影响因子:0
- 作者:Caudill, Victoria;Ralph, Peter L
- 通讯作者:Ralph, Peter L
link-ancestors: fast simulation of local ancestry with tree sequence software.
link-ancestors:使用树序列软件快速模拟当地祖先。
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Tsambos, Georgia;Kelleher, Jerome;Ralph, Peter;Leslie, Stephen;Vukcevic, Damjan
- 通讯作者:Vukcevic, Damjan
SLiM 4: Multispecies Eco-Evolutionary Modeling.
SLiM 4:多物种生态进化模型。
- DOI:
- 发表时间:2023-05
- 期刊:
- 影响因子:0
- 作者:Haller, Benjamin C;Messer, Philipp W
- 通讯作者:Messer, Philipp W
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
PETER Lochhead RALPH其他文献
PETER Lochhead RALPH的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('PETER Lochhead RALPH', 18)}}的其他基金
Scaling up computational genomics with tree sequences
用树序列扩展计算基因组学
- 批准号:
10585745 - 财政年份:2023
- 资助金额:
$ 55.68万 - 项目类别:
相似国自然基金
基于lncRNA NONHSAT042241/hnRNP D/β-catenin轴探讨雷公藤衍生物(LLDT-8)对类风湿关节炎滑膜成纤维细胞功能影响及机制研究
- 批准号:82304988
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
针刺手法和参数对针刺效应启动的影响及其机制
- 批准号:82305416
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
二仙汤影响肾上腺皮质-髓质激素分泌及调控下丘脑温度感受器以缓解“天癸竭”潮热的研究
- 批准号:82374307
- 批准年份:2023
- 资助金额:48 万元
- 项目类别:面上项目
固定翼海空跨域航行器出水稳定性与流体动力载荷影响机制
- 批准号:52371327
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
经济制裁对跨国企业海外研发网络建构的影响:基于被制裁企业的视角
- 批准号:72302155
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Bioethical Issues Associated with Objective Behavioral Measurement of Children with Hearing Loss in Naturalistic Environments
与自然环境中听力损失儿童的客观行为测量相关的生物伦理问题
- 批准号:
10790269 - 财政年份:2023
- 资助金额:
$ 55.68万 - 项目类别:
SCH: AI-Enhanced Multimodal Sensor-on-a-chip for Alzheimer's Disease Detection
SCH:用于阿尔茨海默病检测的人工智能增强型多模态芯片传感器
- 批准号:
10685378 - 财政年份:2022
- 资助金额:
$ 55.68万 - 项目类别:
Functional Connectivity and Baseline Networks of the White Matter Brain: Development and Dissemination of Algorithms and Tools
白质脑的功能连接和基线网络:算法和工具的开发和传播
- 批准号:
10391136 - 财政年份:2022
- 资助金额:
$ 55.68万 - 项目类别:
SpeechSense: An Interactive Sensor Platform for Speech Therapy
SpeechSense:用于言语治疗的交互式传感器平台
- 批准号:
10256832 - 财政年份:2022
- 资助金额:
$ 55.68万 - 项目类别:
Administrative Supplement - Rapid Actionable Data for Opioid Response in Kentucky (RADOR-KY)
行政补充 - 肯塔基州阿片类药物反应的快速可操作数据 (RADOR-KY)
- 批准号:
10850016 - 财政年份:2022
- 资助金额:
$ 55.68万 - 项目类别: