CAREER: Robust and scalable genome-wide phylogenetics
职业:稳健且可扩展的全基因组系统发育学
基本信息
- 批准号:1845967
- 负责人:
- 金额:$ 54.92万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2019
- 资助国家:美国
- 起止时间:2019-02-15 至 2024-01-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The present diversity of life has evolved from a single ancestor through billions of years of evolution. Understanding these evolutionary histories is fascinating, but more importantly, is a crucial precursor to many biological analyses. Some evolutionary relationships are obvious (e.g., a cat is closer to a lion than a chicken) but other consequential relationships are hard to discern. Luckily, evolution operates on the genomes of organisms, and the sequence of genetic changes leaves a trace of the evolutionary histories. Following these traces and reconstructing the evolutionary past, however, is a computational problem, and as it turns out, is a difficult problem. Sophisticated methods are needed to infer a phylogeny: a tree, called tree-of-life, that shows the historical relationships between species. When sequencing whole genomes became possible in the mid-2000s, many believed the sheer amount of data would result in robust reconstructions of phylogenies. While genome sequencing has fulfilled some of its promises, other challenges remain. Large-scale data are hard to adequately model and are hard to screen for errors. As a result, different analyses do not always agree, and also, inference algorithms are pushed to their limits of scalability. Thus, an improved understanding of the tree-of-life requires not just more data but also better algorithms. Interestingly, as data sciences permeate many areas of science, issues of robustness to error and scalability faced in phylogenetics will confront many disciplines. Thus, the next generation of data scientists needs to be trained to consider these concerns when developing algorithms for data analysis.This project seeks to address current limitations in phylogenomics (phylogeny inference from whole genomes) and to integrate issues of robustness and scalability into teaching. The main challenge in phylogenomics is data heterogeneity, and there are two sources of data heterogeneity: real biological processes driving genome evolution that lead to discordant histories across the genome, and artefactual heterogeneity that results from complex pipelines used to prepare the data for inference. Models of real heterogeneity exist. However, current methods often require knowing the source of heterogeneity in advance, are often not scalable, are not always robust to artefactual heterogeneity. The approach taken here is to combine unsupervised learning and discrete optimization to build methods for identifying errors. These techniques will strive to minimize assumptions and will use both parametric and non-parametric statistics. The project will draw on machine learning, multi-criteria optimization, and high-performance computing. If successful, it will dramatically improve the accuracy and scalability of genome-wide phylogeny reconstruction and will help researchers understand intricate patterns in genome evolution. To integrate research and education, this project will enable yearly hackathons that bring together students with computational and biological expertise with the goal of developing robust and scalable methods. The project will also seek to improve the understanding of data science for undergrad and K-12 students, emphasizing for them both the excitement and challenges of analyzing large error-prone datasets. The tools developed here will be publicly available and well-documented. Yearly workshops will be held to help biologists learn and use the tools.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
当前的生活多样性已经从一个祖先到数十亿年的进化发展。了解这些进化历史是令人着迷的,但更重要的是,是许多生物学分析的至关重要的先驱。某些进化关系是显而易见的(例如,猫比鸡比鸡更接近狮子),但其他结果关系很难辨别。幸运的是,进化在生物的基因组上起作用,遗传变化的序列留下了进化史的痕迹。但是,遵循这些痕迹并重建进化过去是一个计算问题,事实证明,这是一个困难的问题。需要进行复杂的方法来推断系统发育:一棵称为生活树的树,显示物种之间的历史关系。当在2000年代中期进行整个基因组进行测序时,许多人认为,大量数据将导致系统发育的强大重建。尽管基因组测序实现了其一些承诺,但仍然存在其他挑战。大规模数据很难充分建模,并且很难筛选出错误。结果,不同的分析并不总是同意,而且推理算法也将其推向其可扩展性限制。因此,对生活树的改进理解不仅需要更多的数据,还需要更好的算法。有趣的是,随着数据科学渗透到科学的许多领域,系统发育学中鲁棒性和可伸缩性的鲁棒性问题将面临许多学科。因此,在开发用于数据分析算法时,需要培训下一代的数据科学家,以考虑这些问题。本项目旨在解决系统基础学(整个基因组的系统发育推断)的当前局限性,并将鲁棒性和可扩展性问题整合到教学中。系统基因组学的主要挑战是数据异质性,并且有两种数据异质性来源:驱动基因组进化的实际生物学过程,导致基因组中不一致的历史以及人为异质性,是由用于准备数据的复杂管道所产生的。存在真实异质性的模型。但是,当前的方法通常需要事先知道异质性的来源,通常是不可伸缩的,并不总是对人为异质性的强大。这里采用的方法是将无监督的学习和离散优化结合起来,以构建用于识别错误的方法。这些技术将努力最大程度地减少假设,并将同时使用参数和非参数统计。该项目将利用机器学习,多标准优化和高性能计算。如果成功,它将显着提高全基因组系统发育重建的准确性和可扩展性,并将帮助研究人员了解基因组进化中的复杂模式。为了整合研究和教育,该项目将实现年度的黑客马拉松,这些黑客马拉松将计算和生物学专业知识的学生召集在一起,以开发可靠和可扩展的方法。该项目还将寻求提高对本科生和K-12学生的数据科学的理解,并强调他们分析大型错误数据集的兴奋和挑战。这里开发的工具将是公开可用的,并有据可查。将举行年度研讨会,以帮助生物学家学习和使用该工具。该奖项反映了NSF的法定任务,并且使用基金会的知识分子优点和更广泛的影响评估标准,被认为值得通过评估。
项目成果
期刊论文数量(22)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
TreeCluster: Clustering biological sequences using phylogenetic trees
- DOI:10.1371/journal.pone.0221068
- 发表时间:2019-08-22
- 期刊:
- 影响因子:3.7
- 作者:Balaban, Metin;Moshiri, Niema;Mirarab, Siavash
- 通讯作者:Mirarab, Siavash
SODA: multi-locus species delimitation using quartet frequencies
SODA:使用四重频率进行多位点物种界定
- DOI:10.1093/bioinformatics/btaa1010
- 发表时间:2020
- 期刊:
- 影响因子:5.8
- 作者:Rabiee, Maryam;Mirarab, Siavash
- 通讯作者:Mirarab, Siavash
Completing gene trees without species trees in sub-quadratic time
- DOI:10.1093/bioinformatics/btab875
- 发表时间:2022-01-03
- 期刊:
- 影响因子:5.8
- 作者:Mai, Uyen;Mirarab, Siavash
- 通讯作者:Mirarab, Siavash
Multispecies Coalescent: Theory and Applications in Phylogenetics
多物种合并:系统发育学的理论与应用
- DOI:10.1146/annurev-ecolsys-012121-095340
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Mirarab, Siavash;Nakhleh, Luay;Warnow, Tandy
- 通讯作者:Warnow, Tandy
Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea
- DOI:10.1038/s41467-019-13443-4
- 发表时间:2019-12-02
- 期刊:
- 影响因子:16.6
- 作者:Zhu, Qiyun;Mai, Uyen;Knight, Rob
- 通讯作者:Knight, Rob
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Siavash Mir arabbaygi其他文献
Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction
- DOI:
- 发表时间:
2015-08 - 期刊:
- 影响因子:0
- 作者:
Siavash Mir arabbaygi - 通讯作者:
Siavash Mir arabbaygi
A Bayesian Framework for Software Regression Testing
- DOI:
- 发表时间:
2008-08 - 期刊:
- 影响因子:0
- 作者:
Siavash Mir arabbaygi - 通讯作者:
Siavash Mir arabbaygi
Siavash Mir arabbaygi的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Siavash Mir arabbaygi', 18)}}的其他基金
III: Small: New algorithms for genome skimming and its applications
III:小:基因组略读的新算法及其应用
- 批准号:
1815485 - 财政年份:2018
- 资助金额:
$ 54.92万 - 项目类别:
Standard Grant
CRII: III: Using Genomic Context to Understand Evolutionary Histories of Individual Genes
CRII:III:利用基因组背景来理解单个基因的进化历史
- 批准号:
1565862 - 财政年份:2016
- 资助金额:
$ 54.92万 - 项目类别:
Standard Grant
相似国自然基金
强壮前沟藻共生细菌降解膦酸酯产生促藻效应的分子机制
- 批准号:42306167
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
高效率强壮消息鉴别码的分析与设计
- 批准号:61202422
- 批准年份:2012
- 资助金额:23.0 万元
- 项目类别:青年科学基金项目
半定松弛与非凸二次约束二次规划研究
- 批准号:11271243
- 批准年份:2012
- 资助金额:60.0 万元
- 项目类别:面上项目
基于复合编码脉冲串的水下主动隐蔽性探测新方法研究
- 批准号:61271414
- 批准年份:2012
- 资助金额:60.0 万元
- 项目类别:面上项目
民航客运网络收益管理若干问题的研究
- 批准号:60776817
- 批准年份:2007
- 资助金额:20.0 万元
- 项目类别:联合基金项目
相似海外基金
CAREER: Scalable and Robust Uncertainty Quantification using Subsampling Markov Chain Monte Carlo Algorithms
职业:使用子采样马尔可夫链蒙特卡罗算法进行可扩展且稳健的不确定性量化
- 批准号:
2340586 - 财政年份:2024
- 资助金额:
$ 54.92万 - 项目类别:
Continuing Grant
CAREER: Towards Scalable and Robust Inference of Phylogenetic Networks
职业:走向可扩展和稳健的系统发育网络推理
- 批准号:
2144367 - 财政年份:2022
- 资助金额:
$ 54.92万 - 项目类别:
Continuing Grant
CAREER: Scalable and Robust Dynamic Matching Market Design
职业:可扩展且稳健的动态匹配市场设计
- 批准号:
1846237 - 财政年份:2019
- 资助金额:
$ 54.92万 - 项目类别:
Continuing Grant
CAREER: Leveraging Combinatorial Structures for Robust and Scalable Learning
职业:利用组合结构实现稳健且可扩展的学习
- 批准号:
1845032 - 财政年份:2019
- 资助金额:
$ 54.92万 - 项目类别:
Continuing Grant
CAREER: Robust, scalable, reliable machine learning
职业:稳健、可扩展、可靠的机器学习
- 批准号:
1750286 - 财政年份:2018
- 资助金额:
$ 54.92万 - 项目类别:
Continuing Grant