CAREER: Robust and scalable genome-wide phylogenetics

职业:稳健且可扩展的全基因组系统发育学

基本信息

  • 批准号:
    1845967
  • 负责人:
  • 金额:
    $ 54.92万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-02-15 至 2024-01-31
  • 项目状态:
    已结题

项目摘要

The present diversity of life has evolved from a single ancestor through billions of years of evolution. Understanding these evolutionary histories is fascinating, but more importantly, is a crucial precursor to many biological analyses. Some evolutionary relationships are obvious (e.g., a cat is closer to a lion than a chicken) but other consequential relationships are hard to discern. Luckily, evolution operates on the genomes of organisms, and the sequence of genetic changes leaves a trace of the evolutionary histories. Following these traces and reconstructing the evolutionary past, however, is a computational problem, and as it turns out, is a difficult problem. Sophisticated methods are needed to infer a phylogeny: a tree, called tree-of-life, that shows the historical relationships between species. When sequencing whole genomes became possible in the mid-2000s, many believed the sheer amount of data would result in robust reconstructions of phylogenies. While genome sequencing has fulfilled some of its promises, other challenges remain. Large-scale data are hard to adequately model and are hard to screen for errors. As a result, different analyses do not always agree, and also, inference algorithms are pushed to their limits of scalability. Thus, an improved understanding of the tree-of-life requires not just more data but also better algorithms. Interestingly, as data sciences permeate many areas of science, issues of robustness to error and scalability faced in phylogenetics will confront many disciplines. Thus, the next generation of data scientists needs to be trained to consider these concerns when developing algorithms for data analysis.This project seeks to address current limitations in phylogenomics (phylogeny inference from whole genomes) and to integrate issues of robustness and scalability into teaching. The main challenge in phylogenomics is data heterogeneity, and there are two sources of data heterogeneity: real biological processes driving genome evolution that lead to discordant histories across the genome, and artefactual heterogeneity that results from complex pipelines used to prepare the data for inference. Models of real heterogeneity exist. However, current methods often require knowing the source of heterogeneity in advance, are often not scalable, are not always robust to artefactual heterogeneity. The approach taken here is to combine unsupervised learning and discrete optimization to build methods for identifying errors. These techniques will strive to minimize assumptions and will use both parametric and non-parametric statistics. The project will draw on machine learning, multi-criteria optimization, and high-performance computing. If successful, it will dramatically improve the accuracy and scalability of genome-wide phylogeny reconstruction and will help researchers understand intricate patterns in genome evolution. To integrate research and education, this project will enable yearly hackathons that bring together students with computational and biological expertise with the goal of developing robust and scalable methods. The project will also seek to improve the understanding of data science for undergrad and K-12 students, emphasizing for them both the excitement and challenges of analyzing large error-prone datasets. The tools developed here will be publicly available and well-documented. Yearly workshops will be held to help biologists learn and use the tools.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
目前的生命多样性是从单一祖先经过数十亿年的进化而演变而来的。了解这些进化历史很有趣,但更重要的是,它是许多生物学分析的重要前提。一些进化关系是显而易见的(例如,猫比鸡更接近狮子),但其他间接关系很难辨别。幸运的是,进化作用于生物体的基因组,基因变化的顺序留下了进化历史的痕迹。然而,追踪这些痕迹并重建进化的过去是一个计算问题,而且事实证明,这是一个困难的问题。需要复杂的方法来推断系统发育:一棵树,称为生命之树,显示物种之间的历史关系。当全基因组测序在 2000 年代中期成为可能时,许多人相信大量的数据将导致系统发育的稳健重建。虽然基因组测序已经实现了一些承诺,但其他挑战仍然存在。大规模数据很难充分建模,也很难筛选错误。因此,不同的分析并不总是一致,而且推理算法的可扩展性也达到了极限。因此,加深对生命树的理解不仅需要更多的数据,还需要更好的算法。有趣的是,随着数据科学渗透到许多科学领域,系统发育学中面临的对错误的鲁棒性和可扩展性问题将面临许多学科。因此,下一代数据科学家需要接受培训,在开发数据分析算法时考虑这些问题。该项目旨在解决当前系统发育组学(从整个基因组进行系统发育推断)的局限性,并将稳健性和可扩展性问题整合到教学中。系统基因组学的主要挑战是数据异质性,数据异质性有两个来源:驱动基因组进化的真实生物过程,导致整个基因组的不一致历史,以及由用于准备推理数据的复杂管道产生的人工异质性。真实异质性的模型是存在的。然而,当前的方法通常需要提前知道异质性的来源,通常不可扩展,并且对于人工异质性并不总是鲁棒的。这里采用的方法是将无监督学习和离散优化相结合来构建识别错误的方法。这些技术将努力最大限度地减少假设,并将使用参数和非参数统计数据。该项目将利用机器学习、多标准优化和高性能计算。如果成功,它将极大地提高全基因组系统发育重建的准确性和可扩展性,并将帮助研究人员了解基因组进化中复杂的模式。为了整合研究和教育,该项目将举办每年一次的黑客马拉松,将具有计算和生物专业知识的学生聚集在一起,目标是开发强大且可扩展的方法。该项目还将寻求提高本科生和 K-12 学生对数据科学的理解,强调他们分析容易出错的大型数据集的兴奋和挑战。这里开发的工具将公开可用并有详细记录。每年都会举办研讨会,帮助生物学家学习和使用这些工具。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(22)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
TreeCluster: Clustering biological sequences using phylogenetic trees
TreeCluster:使用系统发育树对生物序列进行聚类
  • DOI:
    10.1371/journal.pone.0221068
  • 发表时间:
    2019-08
  • 期刊:
  • 影响因子:
    3.7
  • 作者:
    Balaban, Metin;Moshiri, Niema;Mai, Uyen;Jia, Xingfan;Mirarab, Siavash;Bozdag, Serdar
  • 通讯作者:
    Bozdag, Serdar
Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea
10,575 个基因组的系统基因组学揭示了细菌和古细菌领域之间的进化相似性
  • DOI:
    10.1038/s41467-019-13443-4
  • 发表时间:
    2019-12-01
  • 期刊:
  • 影响因子:
    16.6
  • 作者:
    Qiyun Zhu;U. Mai;W. Pfeiffer;Stefan Janssen;F. Asnicar;J. S;ers;ers;P. Belda;Gabriel A. Al;Evguenia Kopylova;Daniel McDonald;T. Kosciólek;John B. Yin;Shi Huang;Nimaich;Salam;Jian‐Yu Jiao;Zijun Wu;Z. Xu;Kalen Cantrell;Yimeng Yang;Erfan Sayyari;M. Rabiee;James T. Morton;S. Podell;D. Knights;Wenjun Li;C. Huttenhower;N. Segata;L. Smarr;S. Mirarab;R. Knight
  • 通讯作者:
    R. Knight
TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution
TAPER:尽管进化速度不同,但仍可精确定位多个序列比对中的错误
  • DOI:
    10.1111/2041-210x.13696
  • 发表时间:
    2021-08
  • 期刊:
  • 影响因子:
    6.6
  • 作者:
    Zhang, Chao;Zhao, Yiming;Braun, Edward L.;Mirarab, Siavash
  • 通讯作者:
    Mirarab, Siavash
Multispecies Coalescent: Theory and Applications in Phylogenetics
多物种合并:系统发育学的理论与应用
  • DOI:
    10.1146/annurev-ecolsys-012121-095340
  • 发表时间:
    2021-11
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Mirarab, Siavash;Nakhleh, Luay;Warnow, Tandy
  • 通讯作者:
    Warnow, Tandy
Phylogenetic double placement of mixed samples
混合样本的系统发育双重放置
  • DOI:
    10.1093/bioinformatics/btaa489
  • 发表时间:
    2020-07
  • 期刊:
  • 影响因子:
    5.8
  • 作者:
    Balaban, Metin;Mirarab, Siavash
  • 通讯作者:
    Mirarab, Siavash
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Siavash Mir arabbaygi其他文献

Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction
用于多序列比对和系统发育重建的新型可扩展方法
  • DOI:
  • 发表时间:
    2015-08-01
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Siavash Mir arabbaygi
  • 通讯作者:
    Siavash Mir arabbaygi
A Bayesian Framework for Software Regression Testing
软件回归测试的贝叶斯框架
  • DOI:
  • 发表时间:
    2008-08-29
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Siavash Mir arabbaygi
  • 通讯作者:
    Siavash Mir arabbaygi

Siavash Mir arabbaygi的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Siavash Mir arabbaygi', 18)}}的其他基金

III: Small: New algorithms for genome skimming and its applications
III:小:基因组略读的新算法及其应用
  • 批准号:
    1815485
  • 财政年份:
    2018
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
CRII: III: Using Genomic Context to Understand Evolutionary Histories of Individual Genes
CRII:III:利用基因组背景来理解单个基因的进化历史
  • 批准号:
    1565862
  • 财政年份:
    2016
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant

相似国自然基金

强壮前沟藻共生细菌降解膦酸酯产生促藻效应的分子机制
  • 批准号:
    42306167
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于复合编码脉冲串的水下主动隐蔽性探测新方法研究
  • 批准号:
    61271414
  • 批准年份:
    2012
  • 资助金额:
    60.0 万元
  • 项目类别:
    面上项目
半定松弛与非凸二次约束二次规划研究
  • 批准号:
    11271243
  • 批准年份:
    2012
  • 资助金额:
    60.0 万元
  • 项目类别:
    面上项目
高效率强壮消息鉴别码的分析与设计
  • 批准号:
    61202422
  • 批准年份:
    2012
  • 资助金额:
    23.0 万元
  • 项目类别:
    青年科学基金项目
民航客运网络收益管理若干问题的研究
  • 批准号:
    60776817
  • 批准年份:
    2007
  • 资助金额:
    20.0 万元
  • 项目类别:
    联合基金项目

相似海外基金

CAREER: Scalable and Robust Uncertainty Quantification using Subsampling Markov Chain Monte Carlo Algorithms
职业:使用子采样马尔可夫链蒙特卡罗算法进行可扩展且稳健的不确定性量化
  • 批准号:
    2340586
  • 财政年份:
    2024
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Continuing Grant
CAREER: Towards Scalable and Robust Inference of Phylogenetic Networks
职业:走向可扩展和稳健的系统发育网络推理
  • 批准号:
    2144367
  • 财政年份:
    2022
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Continuing Grant
CAREER: Leveraging Combinatorial Structures for Robust and Scalable Learning
职业:利用组合结构实现稳健且可扩展的学习
  • 批准号:
    1845032
  • 财政年份:
    2019
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Continuing Grant
CAREER: Scalable and Robust Dynamic Matching Market Design
职业:可扩展且稳健的动态匹配市场设计
  • 批准号:
    1846237
  • 财政年份:
    2019
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Continuing Grant
CAREER: Robust, scalable, reliable machine learning
职业:稳健、可扩展、可靠的机器学习
  • 批准号:
    1750286
  • 财政年份:
    2018
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了