IIBR Informatics: An Efficient Pangenomics Graph Aligner

IIBR 信息学:高效的泛基因组图对齐器

基本信息

  • 批准号:
    2029552
  • 负责人:
  • 金额:
    $ 70.04万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-09-01 至 2024-08-31
  • 项目状态:
    已结题

项目摘要

In the past decade, there has been an effort to sequence and compare the DNA of a large number of individuals of a given species, resulting in not just a single reference genome but a population of genomes of a given species. Enormous public data now are available including the 1,000 Genome Project, the 100K Genome Project, the 1001 Arabidopsis Genomes project, the Rice Genome Annotation Project, and the Bird 10,000 Genomes (B10K) Project. Key software, called short read aligners, align newly sequenced DNA fragments to one (or more) reference genome(s) in order to identify genetic variation within the species. The downstream analysis of this genetic variation finds causal relationships between complex diseases and phenotypes. Existing short read aligners are unable to align to a large number of reference genome(s), due purely to computational constraints. Hence, using a small number of genome(s) to align to reduces the memory and time constraints. Unfortunately, although there is a large percentage genetic similarity between individuals of the same species, the differences are also important and aligning to only a small number of genomes of a given species can lead to some of the DNA fragments not aligning or aligning poorly. This, in turn, makes finding genetic variation between the newly sequenced DNA fragments and the reference genome(s) more challenging. One manner to overcome this challenge is to develop new algorithms and data structures for short read alignment that reduce the computational resources. This project realizes this vision by developing a novel representation of a population of genomes, and creating the algorithms and data structures needed to build, store and update it. Thus, integrated into this project is the goal of advancing biological science and knowledge of model species, and the ideas, and furthering the development of an outreach program that supports first-generation university graduates. An immediate outcome of the work will be research opportunities to under-served students through the Machen Florida Opportunity Scholars program, an organization that aims to foster the success of first-generation university scholars. Short read aligners first build an index from one or more reference genome(s) and subsequently use it to find and extend matched subsequences between sequence reads and the reference(s). The bottleneck of using these read aligners to index thousands of genomes is the space and time needed for construct and store the index. To address the shortcomings associated with using a single reference genome, the concept of graph-based pangenomics aligners has been introduced and widely discussed in the community. Although such methods have been shown to improve on the accuracy over standard sequence-based aligners, their use has not been fully explored. The challenge that prevents the realization a pangenomics graph alignment is that of scalability. The goal of the project is to the developing algorithms that allow for the construction of a pangenomic reference from datasets gathered from large populations. In order to achieve this goal, novel means to build, compress, and update a graph that encapsulates the variation found in the population will be created and implemented. Thus, this work will require further advancements that have impact beyond the stated application. More specifically, it is unknown how to merge the r-index, represent a graph-model of references using sub-linear space, or represent the graph using the r-index. This project will address these open problems, and more broadly, connect two areas of research: succinct data structures and pangenomics. Next, the project will minimize the conceptual gap between compression and mutability. The research community has struggled with the balance between compression and mutability since highly compressed data structures are not able to be altered without reconstruction. This poses unduly constraints when trying to apply these structures to biological datasets that routinely get updated with new data. This project will make significant developments in this area by developing compressed data structures that are mutable for our realization of our pangenomics index. Project website: www.christinaboucher.com/pangenomics-iibrThis award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在过去的十年中,一直努力比较给定物种的大量个体的DNA,这不仅导致单个参考基因组,而且导致给定物种的基因组群体。现在可以使用庞大的公共数据,包括1,000个基因组项目,100K基因组项目,1001拟南芥基因组项目,水稻基因组注释项目和鸟类10,000基因组(B10K)项目。 关键软件,称为简短读取器,将新测序的DNA片段与一个(或更多)参考基因组保持一致,以鉴定物种内的遗传变异。 对这种遗传变异的下游分析发现了复杂疾病与表型之间的因果关系。 现有的简短读取对准器纯粹是由于计算约束而无法与大量参考基因组保持一致。因此,使用少量基因组来对齐以减少记忆和时间限制。 不幸的是,尽管同一物种的个体之间存在很大的遗传相似性,但差异也很重要,并且仅与给定物种的少数基因组保持一致,这可能会导致某些DNA片段不符合或对齐。 反过来,这使得新测序的DNA片段与参考基因组之间的遗传变异更具挑战性。克服这一挑战的一种方式是开发新的算法和数据结构,以减少计算资源的简短读取对齐。该项目通过开发对基因组群体的新颖表示,并创建构建,存储和更新所需的算法和数据结构来实现这一愿景。因此,融入该项目的是推进生物科学和模型物种知识的目标,以及思想,并进一步发展支持第一代大学毕业生的外展计划。这项工作的直接结果将是通过佛罗里达州佛罗里达州机会学者计划计划的研究机会,该组织旨在促进第一代大学学者的成功。 简短读取对准器首先从一个或多个参考基因组构建索引,然后使用它来查找和扩展序列读取和参考文献之间的匹配子序列。使用这些读取器来索引数千个基因组的瓶颈是构造和存储索引所需的空间和时间。 为了解决与使用单个参考基因组相关的缺点,在社区中引入并广泛讨论了基于图的Pangenomics对准器的概念。尽管已经证明这种方法可以提高基于标准序列的对准器的准确性,但尚未完全探索它们的使用。 阻止实现pangenomics图的挑战是可扩展性的挑战。该项目的目标是开发算法,该算法允许从大量人群收集的数据集中构建pangenomic参考。 为了实现这一目标,新颖的方法是建立,压缩和更新封装人群中发现的变化的图形,将创建和实施。因此,这项工作将需要进一步的进步,这些进步超出了规定的应用程序。更具体地说,未知如何合并R索引,使用子线性空间代表参考的图形模型,或使用R-Index表示图形。该项目将解决这些开放问题,更广泛地连接两个研究领域:简洁的数据结构和pangenomics。 接下来,该项目将最大程度地减少压缩和突变性之间的概念差距。 研究界一直在压缩和可突变性之间的平衡斗争,因为没有重建的高度压缩数据结构将无法改变。 当试图将这些结构应用于通常会随附新数据更新的生物数据集时,这会构成不适当的约束。 该项目将通过开发可变形的压缩数据结构来实现我们的Pangenomics指数,从而在这一领域进行重大发展。项目网站:www.christinaboucher.com/pangenomics-iibrthis Award反映了NSF的法定任务,并使用基金会的知识分子优点和更广泛的影响审查标准,认为值得通过评估来获得支持。

项目成果

期刊论文数量(19)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Computational graph pangenomics: a tutorial on data structures and their applications.
  • DOI:
    10.1007/s11047-022-09882-6
  • 发表时间:
    2022-03
  • 期刊:
  • 影响因子:
    2.1
  • 作者:
    Baaijens, Jasmijn A.;Bonizzoni, Paola;Boucher, Christina;Della Vedova, Gianluca;Pirola, Yuri;Rizzi, Raffaella;Siren, Jouni
  • 通讯作者:
    Siren, Jouni
Compressing and Indexing Aligned Readsets
压缩和索引对齐的读取集
Efficiently Merging r-indexes
高效合并 r 索引
  • DOI:
    10.1109/dcc50243.2021.00028
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Oliva, Marco;Rossi, Massimiliano;Siren, Jouni;Manzini, Giovanni;Kahveci, Tamer;Gagie, Travis;Boucher, Christina
  • 通讯作者:
    Boucher, Christina
A Fast and Small Subsampled R-Index
快速且小型的二次采样 R 指数
More Time-Space Tradeoffs for Finding a Shortest Unique Substring
寻找最短唯一子串的更多时空权衡
  • DOI:
    10.3390/a13090234
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    2.3
  • 作者:
    Bannai, Hideo;Gagie, Travis;Hoppenworth, Gary;Puglisi, Simon J.;Russo, Luís M.
  • 通讯作者:
    Russo, Luís M.
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Christina Boucher其他文献

Data Structures for SMEM-Finding in the PBWT
PBWT 中 SMEM 查找的数据结构
  • DOI:
    10.1007/978-3-031-43980-3_8
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    5.4
  • 作者:
    Paola Bonizzoni;Christina Boucher;D. Cozzi;Travis Gagie;Dominik Köppl;Massimiliano Rossi
  • 通讯作者:
    Massimiliano Rossi
Solving the Minimal Positional Substring Cover Problem in Sublinear Space
解决次线性空间中的最小位置子串覆盖问题
ONeSAMP 3.0: Effective Population Size via SNP Data for One Population Sample
ONeSAMP 3.0:通过一个群体样本的 SNP 数据获得有效群体规模
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Aaron Hong;R. G. Cheek;Kingshuk Mukherjee;Isha Yooseph;Marco Oliva;Mark Heim;W. C. Funk;David Tallmon;Christina Boucher
  • 通讯作者:
    Christina Boucher
Parametric and nonparametric probability distribution estimators of sample maximum
样本最大值的参数和非参数概率分布估计器
  • DOI:
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Christina Boucher;Travis Gagie;Tomohiro I;Dominik Koeppl;Ben Langmead;Giovanni Manzini;Gonzalo Navarro;Alejandro Pacheco;Massimiliano Rossi;Moriyama Taku
  • 通讯作者:
    Moriyama Taku
Cliffy: robust 16S rRNA classification based on a compressed LCA index
Cliffy:基于压缩 LCA 索引的稳健 16S rRNA 分类
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Omar Ahmed;Christina Boucher;Ben Langmead
  • 通讯作者:
    Ben Langmead

Christina Boucher的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Christina Boucher', 18)}}的其他基金

Collaborative Research: EAGER: Solving the bait learning problem for large-scale DNA enrichment
合作研究:EAGER:解决大规模 DNA 富集的诱饵学习问题
  • 批准号:
    2118251
  • 财政年份:
    2021
  • 资助金额:
    $ 70.04万
  • 项目类别:
    Standard Grant
SCH: INT: Enabling real time surveillance of antimicrobial resistance
SCH:INT:实现抗菌药物耐药性的实时监测
  • 批准号:
    2013998
  • 财政年份:
    2021
  • 资助金额:
    $ 70.04万
  • 项目类别:
    Standard Grant
III: Small: Collaborative Research: A Scalable and Efficient Optical Map Assembler
III:小型:协作研究:可扩展且高效的光学地图组装器
  • 批准号:
    1618814
  • 财政年份:
    2016
  • 资助金额:
    $ 70.04万
  • 项目类别:
    Standard Grant

相似国自然基金

2023年(第四届)国际生物数学与医学应用研讨会
  • 批准号:
    12342004
  • 批准年份:
    2023
  • 资助金额:
    8.00 万元
  • 项目类别:
    专项项目
突变和修饰重塑蛋白质亚细胞定位的生物信息学研究
  • 批准号:
    32370698
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
知识引导与数据驱动的肝内胆管癌调控关键信号通路识别的信息学模型与应用
  • 批准号:
    32370694
  • 批准年份:
    2023
  • 资助金额:
    50.00 万元
  • 项目类别:
    面上项目
基于生物信息学的类风湿性关节炎患者衰弱预测模型的构建与验证
  • 批准号:
    82301786
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于结构表征的蛋白质与长链非编码RNA相互作用预测的生物信息学方法研究
  • 批准号:
    62373216
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目

相似海外基金

Risk stratifying indeterminate pulmonary nodules with jointly learned features from longitudinal radiologic and clinical big data
利用纵向放射学和临床大数据共同学习的特征对不确定的肺结节进行风险分层
  • 批准号:
    10678264
  • 财政年份:
    2023
  • 资助金额:
    $ 70.04万
  • 项目类别:
AppalTRuST Community Outreach and Participant Engagement Core
AppalTRUST 社区外展和参与者参与核心
  • 批准号:
    10665325
  • 财政年份:
    2023
  • 资助金额:
    $ 70.04万
  • 项目类别:
Appalachian Tobacco Regulatory Science Team (AppalTRuST)
阿巴拉契亚烟草监管科学团队 (AppalTRuST)
  • 批准号:
    10665319
  • 财政年份:
    2023
  • 资助金额:
    $ 70.04万
  • 项目类别:
A multi-modal approach for efficient, point-of-care screening of hypertrophic cardiomyopathy
一种高效、即时筛查肥厚型心肌病的多模式方法
  • 批准号:
    10749588
  • 财政年份:
    2023
  • 资助金额:
    $ 70.04万
  • 项目类别:
By Youth, For Youth: Digital Supported Peer Navigation for Addressing Child Mental Health Inequities
由青年,为青年:数字支持的同伴导航解决儿童心理健康不平等问题
  • 批准号:
    10414497
  • 财政年份:
    2022
  • 资助金额:
    $ 70.04万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了