Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
基本信息
- 批准号:RGPIN-2014-05112
- 负责人:
- 金额:$ 2.57万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2015
- 资助国家:加拿大
- 起止时间:2015-01-01 至 2016-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The proposed research is about building computational technologies to analyze DNA. DNA is composed of sequences of four possible nucleotides (nt): A, C, G and T. The last decade witnessed a revolution in technologies that “read” DNA sequences, with applications in many areas of life sciences.
Data from high throughput sequencing (HTS) platforms reach hundreds of millions of “reads”, where each read represents 75-300 nt of DNA (the human genome – the sum total of our DNA – is around 3 billion nt long). Interpreting these massive volumes of short reads is an ongoing challenge as sequencing technologies evolve.
There are two popular analysis methods that process HTS reads: alignment-based and assembly-based approaches. The first uses a reference genome, whose DNA sequence is known from previous studies of the same or a closely related species. In this approach, reads are aligned to the reference genome through a process that searches for sequence similarities between the reads and the reference. The second approach is a data-driven method that does not assume similarity to any given genome. Instead, it reconstructs the genome represented by the DNA de novo (from scratch). This is a less biased approach that gives a truer representation of the genome, especially if there have been rearrangements compared to the reference genome sequence, or if no reference is available.
The Birol lab has developed de novo assembly algorithms and downstream analysis tools and has applied them in a number of highly visible projects in human health and other fields. In the proposed work, the team will concentrate on alignment technologies as a way to support this highly successful assembly based analysis platform.
The read alignment problem has been addressed several times, to match changes in read lengths and data volumes as HTS technology evolved. However, efficient and accurate alignment of reads to newly assembled genomes is an un-answered need. General purpose read alignment algorithms assume the target sequence to be composed of a small number of long stretches of sequence, essentially, chromosomes. The results of draft de novo assembly processes, in contrast, are typically in hundreds of thousands of pieces. This creates problems for general-purpose aligners, which we will address by developing an algorithm for this specific need. We will pay special attention to the scalability of our algorithm to accommodate the growing volume of data, and we will achieve this by building parallel processing algorithms similar to those used in Internet search engines, such as Google.
When the genome of a new species is sequenced and assembled, one important task is to “annotate” its genes – i.e. mark where they are in the genome, and how they are structured. We also note an important gap in this area, as current alignment technologies were developed for previous generations of sequencing platforms, and have exceeded their limits to support data from new sequencing projects. (One such popular tool, exonerate, is still being heavily used, yet it is no longer being maintained by the developer lab.) We propose to build an alternative to these tools, and provide sustained support for the community.
As the use of sequencing technologies further penetrates life sciences, there is an urgent need for high-quality computational tools to analyze large volumes of data in a timely manner. Development of the described alignment technologies will improve the efficiency and the accuracy of de novo assemblies and their annotation.
拟议的研究是关于建立分析 DNA 的计算技术,DNA 由四种可能的核苷酸 (nt) 序列组成:A、C、G 和 T。过去十年见证了“读取”DNA 序列的技术及其应用的革命。在生命科学的许多领域。
来自高通量测序 (HTS) 平台的数据达到数亿个“读数”,其中每个读数代表 75-300 nt 的 DNA(人类基因组(我们的 DNA 的总和)大约有 30 亿 nt 长)。随着测序技术的发展,大量的短读是一个持续的挑战。
有两种流行的处理 HTS 读取的分析方法:基于比对和基于组装的方法,第一种方法使用参考基因组,其 DNA 序列是从先前对相同或密切相关物种的研究中得知的。第二种方法是一种数据驱动的方法,它不假设与任何给定基因组的相似性,而是重建由 DNA de 表示的基因组。 novo(从头开始)。这是一种较少偏见的方法。给出了基因组的更真实的表示,特别是如果与参考基因组序列相比发生了重排,或者没有可用的参考。
比罗尔实验室开发了从头组装算法和下游分析工具,并将其应用于人类健康和其他领域的许多引人注目的项目中,该团队将专注于对齐技术作为支持这种高度可见的方式。成功的基于装配的分析平台。
随着 HTS 技术的发展,读取对齐问题已被多次解决,以匹配读取长度和数据量的变化。然而,通用读取对齐算法假设的有效且准确的对齐是一个尚未得到满足的需求。相比之下,目标序列由少量长序列(本质上是染色体)组成,而从头组装过程的结果通常是数十万个片段,这给通用对准器带来了问题。我们将通过开发一种算法来解决这个问题我们将特别关注算法的可扩展性,以适应不断增长的数据量,我们将通过构建类似于互联网搜索引擎(例如 Google)中使用的并行处理算法来实现这一目标。
当一个新物种的基因组被测序和组装时,一项重要的任务是“注释”它的基因——即标记它们在基因组中的位置,以及它们的结构,我们还注意到这一领域的一个重要差距,正如目前的情况一样。比对技术是为前几代测序平台开发的,并且已经超出了支持新测序项目数据的限制(一种流行的工具,exonerate,仍在大量使用,但开发实验室不再维护它。 )我们建议建立一个替代方案这些工具,并为社区提供持续的支持。
随着测序技术的使用进一步渗透到生命科学中,迫切需要高质量的计算工具来及时分析大量数据,所描述的比对技术的开发将提高从头组装的效率和准确性。以及他们的注释。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Birol, Inanc其他文献
Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response.
- DOI:
10.1093/g3journal/jkad247 - 发表时间:
2023-12-29 - 期刊:
- 影响因子:2.6
- 作者:
Lo, Theodora;Coombe, Lauren;Gagalova, Kristina K.;Marr, Alex;Warren, Rene L.;Kirk, Heather;Pandoh, Pawan;Zhao, Yongjun;Moore, Richard A.;Mungall, Andrew J.;Ritland, Carol;Pavy, Nathalie;Jones, Steven J. M.;Bohlmann, Joerg;Bousquet, Jean;Birol, Inanc;Thomson, Ashley - 通讯作者:
Thomson, Ashley
Linear time complexity de novo long read genome assembly with GoldRush.
- DOI:
10.1038/s41467-023-38716-x - 发表时间:
2023-05-22 - 期刊:
- 影响因子:16.6
- 作者:
Wong, Johnathan;Coombe, Lauren;Nikolic, Vladimir;Zhang, Emily;Nip, Ka Ming;Sidhu, Puneet;Warren, Rene L.;Birol, Inanc - 通讯作者:
Birol, Inanc
Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma.
- DOI:
10.1038/nature10351 - 发表时间:
2011-07-27 - 期刊:
- 影响因子:64.8
- 作者:
Morin, Ryan D.;Mendez-Lago, Maria;Mungall, Andrew J.;Goya, Rodrigo;Mungall, Karen L.;Corbett, Richard D.;Johnson, Nathalie A.;Severson, Tesa M.;Chiu, Readman;Field, Matthew;Jackman, Shaun;Krzywinski, Martin;Scott, David W.;Trinh, Diane L.;Tamura-Wells, Jessica;Li, Sa;Firme, Marlo R.;Rogic, Sanja;Griffith, Malachi;Chan, Susanna;Yakovenko, Oleksandr;Meyer, Irmtraud M.;Zhao, Eric Y.;Smailus, Duane;Moksa, Michelle;Chittaranjan, Suganthi;Rimsza, Lisa;Brooks-Wilson, Angela;Spinelli, John J.;Ben-Neriah, Susana;Meissner, Barbara;Woolcock, Bruce;Boyle, Merrill;McDonald, Helen;Tam, Angela;Zhao, Yongjun;Delaney, Allen;Zeng, Thomas;Tse, Kane;Butterfield, Yaron;Birol, Inanc;Holt, Rob;Schein, Jacqueline;Horsman, Douglas E.;Moore, Richard;Jones, Steven J. M.;Connors, Joseph M.;Hirst, Martin;Gascoyne, Randy D.;Marra, Marco A. - 通讯作者:
Marra, Marco A.
Antimicrobial peptides from Rana [Lithobates] catesbeiana: Gene structure and bioinformatic identification of novel forms from tadpoles
- DOI:
10.1038/s41598-018-38442-1 - 发表时间:
2019-02-06 - 期刊:
- 影响因子:4.6
- 作者:
Helbing, Caren C.;Hammond, S. Austin;Birol, Inanc - 通讯作者:
Birol, Inanc
Theoretical Analysis of the Minimum Sum of Squared Similarities Sampling for Nystrom-Based Spectral Clustering
- DOI:
10.1109/ijcnn.2016.7727698 - 发表时间:
2016-01-01 - 期刊:
- 影响因子:0
- 作者:
Bouneffouf, Djallel;Birol, Inanc - 通讯作者:
Birol, Inanc
Birol, Inanc的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Birol, Inanc', 18)}}的其他基金
Novel Data Structures And Scalable Algorithms For High Throughput Bioinformatics
高通量生物信息学的新颖数据结构和可扩展算法
- 批准号:
RGPIN-2019-06640 - 财政年份:2022
- 资助金额:
$ 2.57万 - 项目类别:
Discovery Grants Program - Individual
Novel Data Structures And Scalable Algorithms For High Throughput Bioinformatics
高通量生物信息学的新颖数据结构和可扩展算法
- 批准号:
RGPIN-2019-06640 - 财政年份:2021
- 资助金额:
$ 2.57万 - 项目类别:
Discovery Grants Program - Individual
Novel Data Structures And Scalable Algorithms For High Throughput Bioinformatics
高通量生物信息学的新颖数据结构和可扩展算法
- 批准号:
RGPIN-2019-06640 - 财政年份:2020
- 资助金额:
$ 2.57万 - 项目类别:
Discovery Grants Program - Individual
Novel Data Structures And Scalable Algorithms For High Throughput Bioinformatics
高通量生物信息学的新颖数据结构和可扩展算法
- 批准号:
RGPIN-2019-06640 - 财政年份:2019
- 资助金额:
$ 2.57万 - 项目类别:
Discovery Grants Program - Individual
Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
- 批准号:
RGPIN-2014-05112 - 财政年份:2018
- 资助金额:
$ 2.57万 - 项目类别:
Discovery Grants Program - Individual
Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
- 批准号:
RGPIN-2014-05112 - 财政年份:2017
- 资助金额:
$ 2.57万 - 项目类别:
Discovery Grants Program - Individual
Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
- 批准号:
RGPIN-2014-05112 - 财政年份:2016
- 资助金额:
$ 2.57万 - 项目类别:
Discovery Grants Program - Individual
Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
- 批准号:
RGPIN-2014-05112 - 财政年份:2014
- 资助金额:
$ 2.57万 - 项目类别:
Discovery Grants Program - Individual
相似国自然基金
基于片段重叠群的基因组片段填充问题研究
- 批准号:61902221
- 批准年份:2019
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
相似海外基金
Software for the complete characterization of antibody repertoires: from germline and mRNA sequence assembly to deep learning predictions of their protein structures and targets
用于完整表征抗体库的软件:从种系和 mRNA 序列组装到其蛋白质结构和靶标的深度学习预测
- 批准号:
10699546 - 财政年份:2023
- 资助金额:
$ 2.57万 - 项目类别:
Pangenomics of nicotine abuse in the hybrid rat diversity panel
混合大鼠多样性小组中尼古丁滥用的泛基因组学
- 批准号:
10582448 - 财政年份:2023
- 资助金额:
$ 2.57万 - 项目类别:
Novel bioinformatics methods for integrative detection of structural variants from long-read sequencing
用于从长读长测序中综合检测结构变异的新型生物信息学方法
- 批准号:
10752265 - 财政年份:2023
- 资助金额:
$ 2.57万 - 项目类别:
Revealing new short tandem repeat variation in the human population across sequencing technologies: towards rare disease diagnosis and discovery
跨测序技术揭示人群中新的短串联重复变异:迈向罕见病诊断和发现
- 批准号:
10572951 - 财政年份:2023
- 资助金额:
$ 2.57万 - 项目类别:
Development of high-quality reference genomes for Anopheles squamosus and An. cydippis
开发鳞状按蚊和按蚊的高质量参考基因组。
- 批准号:
10725180 - 财政年份:2023
- 资助金额:
$ 2.57万 - 项目类别: