IIBR Informatics: An Efficient Pangenomics Graph Aligner

IIBR 信息学：高效的泛基因组图对齐器

基本信息

批准号：
2029552
负责人：
Christina Boucher
金额：
$ 70.04万
依托单位：
University of Florida
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-09-01 至 2024-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2029552&HistoricalAwards=false
关键词：
IIBR Informatics Efficient Pangenomics Graph

项目摘要

In the past decade, there has been an effort to sequence and compare the DNA of a large number of individuals of a given species, resulting in not just a single reference genome but a population of genomes of a given species. Enormous public data now are available including the 1,000 Genome Project, the 100K Genome Project, the 1001 Arabidopsis Genomes project, the Rice Genome Annotation Project, and the Bird 10,000 Genomes (B10K) Project. Key software, called short read aligners, align newly sequenced DNA fragments to one (or more) reference genome(s) in order to identify genetic variation within the species. The downstream analysis of this genetic variation finds causal relationships between complex diseases and phenotypes. Existing short read aligners are unable to align to a large number of reference genome(s), due purely to computational constraints. Hence, using a small number of genome(s) to align to reduces the memory and time constraints. Unfortunately, although there is a large percentage genetic similarity between individuals of the same species, the differences are also important and aligning to only a small number of genomes of a given species can lead to some of the DNA fragments not aligning or aligning poorly. This, in turn, makes finding genetic variation between the newly sequenced DNA fragments and the reference genome(s) more challenging. One manner to overcome this challenge is to develop new algorithms and data structures for short read alignment that reduce the computational resources. This project realizes this vision by developing a novel representation of a population of genomes, and creating the algorithms and data structures needed to build, store and update it. Thus, integrated into this project is the goal of advancing biological science and knowledge of model species, and the ideas, and furthering the development of an outreach program that supports first-generation university graduates. An immediate outcome of the work will be research opportunities to under-served students through the Machen Florida Opportunity Scholars program, an organization that aims to foster the success of first-generation university scholars. Short read aligners first build an index from one or more reference genome(s) and subsequently use it to find and extend matched subsequences between sequence reads and the reference(s). The bottleneck of using these read aligners to index thousands of genomes is the space and time needed for construct and store the index. To address the shortcomings associated with using a single reference genome, the concept of graph-based pangenomics aligners has been introduced and widely discussed in the community. Although such methods have been shown to improve on the accuracy over standard sequence-based aligners, their use has not been fully explored. The challenge that prevents the realization a pangenomics graph alignment is that of scalability. The goal of the project is to the developing algorithms that allow for the construction of a pangenomic reference from datasets gathered from large populations. In order to achieve this goal, novel means to build, compress, and update a graph that encapsulates the variation found in the population will be created and implemented. Thus, this work will require further advancements that have impact beyond the stated application. More specifically, it is unknown how to merge the r-index, represent a graph-model of references using sub-linear space, or represent the graph using the r-index. This project will address these open problems, and more broadly, connect two areas of research: succinct data structures and pangenomics. Next, the project will minimize the conceptual gap between compression and mutability. The research community has struggled with the balance between compression and mutability since highly compressed data structures are not able to be altered without reconstruction. This poses unduly constraints when trying to apply these structures to biological datasets that routinely get updated with new data. This project will make significant developments in this area by developing compressed data structures that are mutable for our realization of our pangenomics index. Project website: www.christinaboucher.com/pangenomics-iibrThis award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

在过去的十年中，一直努力比较给定物种的大量个体的DNA，这不仅导致单个参考基因组，而且导致给定物种的基因组群体。现在可以使用庞大的公共数据，包括1,000个基因组项目，100K基因组项目，1001拟南芥基因组项目，水稻基因组注释项目和鸟类10,000基因组（B10K）项目。关键软件，称为简短读取器，将新测序的DNA片段与一个（或更多）参考基因组保持一致，以鉴定物种内的遗传变异。对这种遗传变异的下游分析发现了复杂疾病与表型之间的因果关系。现有的简短读取对准器纯粹是由于计算约束而无法与大量参考基因组保持一致。因此，使用少量基因组来对齐以减少记忆和时间限制。不幸的是，尽管同一物种的个体之间存在很大的遗传相似性，但差异也很重要，并且仅与给定物种的少数基因组保持一致，这可能会导致某些DNA片段不符合或对齐。反过来，这使得新测序的DNA片段与参考基因组之间的遗传变异更具挑战性。克服这一挑战的一种方式是开发新的算法和数据结构，以减少计算资源的简短读取对齐。该项目通过开发对基因组群体的新颖表示，并创建构建，存储和更新所需的算法和数据结构来实现这一愿景。因此，融入该项目的是推进生物科学和模型物种知识的目标，以及思想，并进一步发展支持第一代大学毕业生的外展计划。这项工作的直接结果将是通过佛罗里达州佛罗里达州机会学者计划计划的研究机会，该组织旨在促进第一代大学学者的成功。简短读取对准器首先从一个或多个参考基因组构建索引，然后使用它来查找和扩展序列读取和参考文献之间的匹配子序列。使用这些读取器来索引数千个基因组的瓶颈是构造和存储索引所需的空间和时间。为了解决与使用单个参考基因组相关的缺点，在社区中引入并广泛讨论了基于图的Pangenomics对准器的概念。尽管已经证明这种方法可以提高基于标准序列的对准器的准确性，但尚未完全探索它们的使用。阻止实现pangenomics图的挑战是可扩展性的挑战。该项目的目标是开发算法，该算法允许从大量人群收集的数据集中构建pangenomic参考。为了实现这一目标，新颖的方法是建立，压缩和更新封装人群中发现的变化的图形，将创建和实施。因此，这项工作将需要进一步的进步，这些进步超出了规定的应用程序。更具体地说，未知如何合并R索引，使用子线性空间代表参考的图形模型，或使用R-Index表示图形。该项目将解决这些开放问题，更广泛地连接两个研究领域：简洁的数据结构和pangenomics。接下来，该项目将最大程度地减少压缩和突变性之间的概念差距。研究界一直在压缩和可突变性之间的平衡斗争，因为没有重建的高度压缩数据结构将无法改变。当试图将这些结构应用于通常会随附新数据更新的生物数据集时，这会构成不适当的约束。该项目将通过开发可变形的压缩数据结构来实现我们的Pangenomics指数，从而在这一领域进行重大发展。项目网站：www.christinaboucher.com/pangenomics-iibrthis Award反映了NSF的法定任务，并使用基金会的知识分子优点和更广泛的影响审查标准，认为值得通过评估来获得支持。

项目成果

期刊论文数量（19）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Computational graph pangenomics: a tutorial on data structures and their applications.

DOI：
10.1007/s11047-022-09882-6
发表时间：
2022-03
期刊：
NATURAL COMPUTING
影响因子：
2.1
作者：
Baaijens, Jasmijn A.;Bonizzoni, Paola;Boucher, Christina;Della Vedova, Gianluca;Pirola, Yuri;Rizzi, Raffaella;Siren, Jouni
通讯作者：
Siren, Jouni

Compressing and Indexing Aligned Readsets

压缩和索引对齐的读取集

DOI：
10.4230/lipics.wabi.2021.13
发表时间：
2021
期刊：
Workshop on Algorithms in Bioinformatics (WABI
影响因子：
0
作者：
Gagie, Travis;Gourdel, Garance;Manzini, Giovanni
通讯作者：
Manzini, Giovanni

Efficiently Merging r-indexes

高效合并 r 索引

DOI：
10.1109/dcc50243.2021.00028
发表时间：
2021
期刊：
2021 Data Compression Conference (DCC
影响因子：
0
作者：
Oliva, Marco;Rossi, Massimiliano;Siren, Jouni;Manzini, Giovanni;Kahveci, Tamer;Gagie, Travis;Boucher, Christina
通讯作者：
Boucher, Christina

A Fast and Small Subsampled R-Index

快速且小型的二次采样 R 指数

DOI：
10.4230/lipics.cpm.2021.13
发表时间：
2021
期刊：
Leibniz international proceedings in informatics
影响因子：
0
作者：
Cobas, Dustin;Gagie, Travis;Navarro, Gonzalo
通讯作者：
Navarro, Gonzalo

More Time-Space Tradeoffs for Finding a Shortest Unique Substring

寻找最短唯一子串的更多时空权衡

DOI：
10.3390/a13090234
发表时间：
2020
期刊：
Algorithms
影响因子：
2.3
作者：
Bannai, Hideo;Gagie, Travis;Hoppenworth, Gary;Puglisi, Simon J.;Russo, Luís M.
通讯作者：
Russo, Luís M.

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Christina Boucher其他文献

Data Structures for SMEM-Finding in the PBWT

PBWT 中 SMEM 查找的数据结构

DOI：
10.1007/978-3-031-43980-3_8
发表时间：
2023
期刊：
Theoretical and Applied Genetics
影响因子：
5.4
作者：
Paola Bonizzoni;Christina Boucher;D. Cozzi;Travis Gagie;Dominik Köppl;Massimiliano Rossi
通讯作者：
Massimiliano Rossi

Solving the Minimal Positional Substring Cover Problem in Sublinear Space

解决次线性空间中的最小位置子串覆盖问题

DOI：
发表时间：
2024
期刊：
Annual Symposium on Combinatorial Pattern Matching
影响因子：
0
作者：
Paola Bonizzoni;Christina Boucher;D. Cozzi;Travis Gagie;Yuri Pirola
通讯作者：
Yuri Pirola

ONeSAMP 3.0: Effective Population Size via SNP Data for One Population Sample

ONeSAMP 3.0：通过一个群体样本的 SNP 数据获得有效群体规模

DOI：
发表时间：
2023
期刊：
bioRxiv
影响因子：
0
作者：
Aaron Hong;R. G. Cheek;Kingshuk Mukherjee;Isha Yooseph;Marco Oliva;Mark Heim;W. C. Funk;David Tallmon;Christina Boucher
通讯作者：
Christina Boucher

Parametric and nonparametric probability distribution estimators of sample maximum

样本最大值的参数和非参数概率分布估计器

DOI：
发表时间：
2021
期刊：
arXiv
影响因子：
0
作者：
Christina Boucher;Travis Gagie;Tomohiro I;Dominik Koeppl;Ben Langmead;Giovanni Manzini;Gonzalo Navarro;Alejandro Pacheco;Massimiliano Rossi;Moriyama Taku
通讯作者：
Moriyama Taku