Read-to-contig alignments for de novo genome assembly and annotation

用于从头基因组组装和注释的读取到重叠群比对

基本信息

  • 批准号:
    RGPIN-2014-05112
  • 负责人:
  • 金额:
    $ 2.57万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Discovery Grants Program - Individual
  • 财政年份:
    2015
  • 资助国家:
    加拿大
  • 起止时间:
    2015-01-01 至 2016-12-31
  • 项目状态:
    已结题

项目摘要

The proposed research is about building computational technologies to analyze DNA. DNA is composed of sequences of four possible nucleotides (nt): A, C, G and T. The last decade witnessed a revolution in technologies that “read” DNA sequences, with applications in many areas of life sciences. Data from high throughput sequencing (HTS) platforms reach hundreds of millions of “reads”, where each read represents 75-300 nt of DNA (the human genome – the sum total of our DNA – is around 3 billion nt long). Interpreting these massive volumes of short reads is an ongoing challenge as sequencing technologies evolve. There are two popular analysis methods that process HTS reads: alignment-based and assembly-based approaches. The first uses a reference genome, whose DNA sequence is known from previous studies of the same or a closely related species. In this approach, reads are aligned to the reference genome through a process that searches for sequence similarities between the reads and the reference. The second approach is a data-driven method that does not assume similarity to any given genome. Instead, it reconstructs the genome represented by the DNA de novo (from scratch). This is a less biased approach that gives a truer representation of the genome, especially if there have been rearrangements compared to the reference genome sequence, or if no reference is available. The Birol lab has developed de novo assembly algorithms and downstream analysis tools and has applied them in a number of highly visible projects in human health and other fields. In the proposed work, the team will concentrate on alignment technologies as a way to support this highly successful assembly based analysis platform. The read alignment problem has been addressed several times, to match changes in read lengths and data volumes as HTS technology evolved. However, efficient and accurate alignment of reads to newly assembled genomes is an un-answered need. General purpose read alignment algorithms assume the target sequence to be composed of a small number of long stretches of sequence, essentially, chromosomes. The results of draft de novo assembly processes, in contrast, are typically in hundreds of thousands of pieces. This creates problems for general-purpose aligners, which we will address by developing an algorithm for this specific need. We will pay special attention to the scalability of our algorithm to accommodate the growing volume of data, and we will achieve this by building parallel processing algorithms similar to those used in Internet search engines, such as Google. When the genome of a new species is sequenced and assembled, one important task is to “annotate” its genes – i.e. mark where they are in the genome, and how they are structured. We also note an important gap in this area, as current alignment technologies were developed for previous generations of sequencing platforms, and have exceeded their limits to support data from new sequencing projects. (One such popular tool, exonerate, is still being heavily used, yet it is no longer being maintained by the developer lab.) We propose to build an alternative to these tools, and provide sustained support for the community. As the use of sequencing technologies further penetrates life sciences, there is an urgent need for high-quality computational tools to analyze large volumes of data in a timely manner. Development of the described alignment technologies will improve the efficiency and the accuracy of de novo assemblies and their annotation.
拟议的研究是建立计算技术来分析DNA。 DNA由四个可能的核苷酸(NT)的序列组成:A,C,G和T。最近十年,目睹了“读取” DNA序列的技术革命,并在许多生命科学领域中应用。 来自高吞吐量测序(HTS)平台的数据达到了数亿个“读取”,每个读取代表75-300 nt的DNA(人类基因组 - 我们的DNA的总和 - 大约30亿nt)。随着测序技术的发展,解释这些大量的简短读数是一个持续的挑战。 有两种流行的分析方法可以处理HTS读取的方法:基于对齐和基于组装的方法。第一个使用参考基因​​组,其DNA序列是从对该物种或密切相关的物种的先前研究中知道的。在这种方法中,读取通过搜索读取和参考之间的序列相似性的过程与参考基因组保持一致。第二种方法是一种数据驱动的方法,该方法与任何给定基因组都不相似。取而代之的是,它重建了由DNA de Novo(从头开始)所代表的基因组。这是一种偏见的方法,可以真正表示基因组,尤其是在与参考基因组序列相比有重排时,或者如果没有参考。 Birol Lab已开发了从头组装算法和下游分析工具,并将它们应用于人类健康和其他领域的许多高度可见的项目中。在拟议的工作中,团队将集中精力对齐技术,以支持这个非常成功的基于组装的分析平台。 随着HTS技术的发展,读取对齐问题已经多次解决,以匹配读取长度和数据量的变化。但是,读取与新组装的基因组的有效和准确的对齐是未解决的需求。通用读取比对算法假定目标序列由少数长序列组成,本质上是染色体。相比之下,从头组装过程草案的结果通常分为数十万件。这会给通用对准器带来问题,我们将通过为这种特定需求开发算法来解决。我们将特别注意算法的可扩展性,以适应不断增长的数据量,我们将通过构建与互联网搜索引擎(例如Google)类似的算法来构建并行处理算法。 当对新物种的基因组进行测序和组装时,一个重要的任务是“注释”其基因 - 即标记它们在基因组中的位置以及它们的结构方式。我们还注意到该领域的一个重要差距,因为目前的对准技术是针对前几代测序平台开发的,并且超出了其支持新测序项目的数据的限制。 (一种流行的工具,即被驱逐的工具仍在大量使用中,但是开发人员实验室不再维护它。)我们建议建立这些工具的替代方案,并为社区提供持续的支持。 随着测序技术的使用进一步渗透了生命科学,迫切需要高质量的计算工具及时分析大量数据。所描述的对齐技术的开发将提高从头组装及其注释的效率和准确性。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Birol, Inanc其他文献

Theoretical Analysis of the Minimum Sum of Squared Similarities Sampling for Nystrom-Based Spectral Clustering
Antimicrobial peptides from Rana [Lithobates] catesbeiana: Gene structure and bioinformatic identification of novel forms from tadpoles
  • DOI:
    10.1038/s41598-018-38442-1
  • 发表时间:
    2019-02-06
  • 期刊:
  • 影响因子:
    4.6
  • 作者:
    Helbing, Caren C.;Hammond, S. Austin;Birol, Inanc
  • 通讯作者:
    Birol, Inanc
Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma.
  • DOI:
    10.1038/nature10351
  • 发表时间:
    2011-07-27
  • 期刊:
  • 影响因子:
    64.8
  • 作者:
    Morin, Ryan D.;Mendez-Lago, Maria;Mungall, Andrew J.;Goya, Rodrigo;Mungall, Karen L.;Corbett, Richard D.;Johnson, Nathalie A.;Severson, Tesa M.;Chiu, Readman;Field, Matthew;Jackman, Shaun;Krzywinski, Martin;Scott, David W.;Trinh, Diane L.;Tamura-Wells, Jessica;Li, Sa;Firme, Marlo R.;Rogic, Sanja;Griffith, Malachi;Chan, Susanna;Yakovenko, Oleksandr;Meyer, Irmtraud M.;Zhao, Eric Y.;Smailus, Duane;Moksa, Michelle;Chittaranjan, Suganthi;Rimsza, Lisa;Brooks-Wilson, Angela;Spinelli, John J.;Ben-Neriah, Susana;Meissner, Barbara;Woolcock, Bruce;Boyle, Merrill;McDonald, Helen;Tam, Angela;Zhao, Yongjun;Delaney, Allen;Zeng, Thomas;Tse, Kane;Butterfield, Yaron;Birol, Inanc;Holt, Rob;Schein, Jacqueline;Horsman, Douglas E.;Moore, Richard;Jones, Steven J. M.;Connors, Joseph M.;Hirst, Martin;Gascoyne, Randy D.;Marra, Marco A.
  • 通讯作者:
    Marra, Marco A.
Linear time complexity de novo long read genome assembly with GoldRush.
  • DOI:
    10.1038/s41467-023-38716-x
  • 发表时间:
    2023-05-22
  • 期刊:
  • 影响因子:
    16.6
  • 作者:
    Wong, Johnathan;Coombe, Lauren;Nikolic, Vladimir;Zhang, Emily;Nip, Ka Ming;Sidhu, Puneet;Warren, Rene L.;Birol, Inanc
  • 通讯作者:
    Birol, Inanc
Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response.
  • DOI:
    10.1093/g3journal/jkad247
  • 发表时间:
    2023-12-29
  • 期刊:
  • 影响因子:
    2.6
  • 作者:
    Lo, Theodora;Coombe, Lauren;Gagalova, Kristina K.;Marr, Alex;Warren, Rene L.;Kirk, Heather;Pandoh, Pawan;Zhao, Yongjun;Moore, Richard A.;Mungall, Andrew J.;Ritland, Carol;Pavy, Nathalie;Jones, Steven J. M.;Bohlmann, Joerg;Bousquet, Jean;Birol, Inanc;Thomson, Ashley
  • 通讯作者:
    Thomson, Ashley

Birol, Inanc的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Birol, Inanc', 18)}}的其他基金

Novel Data Structures And Scalable Algorithms For High Throughput Bioinformatics
高通量生物信息学的新颖数据结构和可扩展算法
  • 批准号:
    RGPIN-2019-06640
  • 财政年份:
    2022
  • 资助金额:
    $ 2.57万
  • 项目类别:
    Discovery Grants Program - Individual
Novel Data Structures And Scalable Algorithms For High Throughput Bioinformatics
高通量生物信息学的新颖数据结构和可扩展算法
  • 批准号:
    RGPIN-2019-06640
  • 财政年份:
    2021
  • 资助金额:
    $ 2.57万
  • 项目类别:
    Discovery Grants Program - Individual
Novel Data Structures And Scalable Algorithms For High Throughput Bioinformatics
高通量生物信息学的新颖数据结构和可扩展算法
  • 批准号:
    RGPIN-2019-06640
  • 财政年份:
    2020
  • 资助金额:
    $ 2.57万
  • 项目类别:
    Discovery Grants Program - Individual
Novel Data Structures And Scalable Algorithms For High Throughput Bioinformatics
高通量生物信息学的新颖数据结构和可扩展算法
  • 批准号:
    RGPIN-2019-06640
  • 财政年份:
    2019
  • 资助金额:
    $ 2.57万
  • 项目类别:
    Discovery Grants Program - Individual
Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
  • 批准号:
    RGPIN-2014-05112
  • 财政年份:
    2018
  • 资助金额:
    $ 2.57万
  • 项目类别:
    Discovery Grants Program - Individual
Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
  • 批准号:
    RGPIN-2014-05112
  • 财政年份:
    2017
  • 资助金额:
    $ 2.57万
  • 项目类别:
    Discovery Grants Program - Individual
Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
  • 批准号:
    RGPIN-2014-05112
  • 财政年份:
    2016
  • 资助金额:
    $ 2.57万
  • 项目类别:
    Discovery Grants Program - Individual
Read-to-contig alignments for de novo genome assembly and annotation
用于从头基因组组装和注释的读取到重叠群比对
  • 批准号:
    RGPIN-2014-05112
  • 财政年份:
    2014
  • 资助金额:
    $ 2.57万
  • 项目类别:
    Discovery Grants Program - Individual

相似国自然基金

基于片段重叠群的基因组片段填充问题研究
  • 批准号:
    61902221
  • 批准年份:
    2019
  • 资助金额:
    24.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Software for the complete characterization of antibody repertoires: from germline and mRNA sequence assembly to deep learning predictions of their protein structures and targets
用于完整表征抗体库的软件:从种系和 mRNA 序列组装到其蛋白质结构和靶标的深度学习预测
  • 批准号:
    10699546
  • 财政年份:
    2023
  • 资助金额:
    $ 2.57万
  • 项目类别:
Pangenomics of nicotine abuse in the hybrid rat diversity panel
混合大鼠多样性小组中尼古丁滥用的泛基因组学
  • 批准号:
    10582448
  • 财政年份:
    2023
  • 资助金额:
    $ 2.57万
  • 项目类别:
Novel bioinformatics methods for integrative detection of structural variants from long-read sequencing
用于从长读长测序中综合检测结构变异的新型生物信息学方法
  • 批准号:
    10752265
  • 财政年份:
    2023
  • 资助金额:
    $ 2.57万
  • 项目类别:
Revealing new short tandem repeat variation in the human population across sequencing technologies: towards rare disease diagnosis and discovery
跨测序技术揭示人群中新的短串联重复变异:迈向罕见病诊断和发现
  • 批准号:
    10572951
  • 财政年份:
    2023
  • 资助金额:
    $ 2.57万
  • 项目类别:
Development of high-quality reference genomes for Anopheles squamosus and An. cydippis
开发鳞状按蚊和按蚊的高质量参考基因组。
  • 批准号:
    10725180
  • 财政年份:
    2023
  • 资助金额:
    $ 2.57万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了