III: Small: Collaborative Research: Supporting Efficient Discrete Box Queries for Sequence Analysis on Large Scale Genome Databases

III:小型:协作研究:支持高效离散框查询以进行大规模基因组数据库的序列分析

基本信息

  • 批准号:
    1319909
  • 负责人:
  • 金额:
    $ 27.34万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2013
  • 资助国家:
    美国
  • 起止时间:
    2013-09-01 至 2018-08-31
  • 项目状态:
    已结题

项目摘要

This collaborative research project, conducted jointly by the investigators from the Michigan State University (MSU) and the University of Michigan at Dearborn (UM-D), investigates the issues and techniques for storing and searching/querying large scale k-mer data sets (i.e., overlapping k-length subsequences obtained from genome sequences) for sequence analysis in bioinformatics. Efficient k-mer indexing, storage and retrieval are vital to sequence analysis tasks like error correction as sequencing data set sizes increase vastly. Most existing methods for storing and searching k-mers are optimized for exact or range queries. However, this reliance limits the types of sequence analysis that can be done efficiently. Moreover, most existing methods for storing k-mers do not support efficient storage of k-mers at multiple word lengths. For many sequence analysis problems, including error correction, variant detection, and assembly, searches with multiple word lengths enable better sensitivity and specificity. In this project, various techniques for efficiently supporting so-called (discrete) box queries and other related queries (e.g., hybrid queries) on large scale k-mer data sets for sequence analysis are investigated. The approaches to optimizing box queries in solving sequence analysis problems like the error correction are examined. The storage structure and adoption of box queries for supporting searches with multiple word lengths on k-mer data sets are explored. The results from this research will advance the state of knowledge for storage, indexing and retrieval techniques for genome sequence databases. They are expected to significantly impact current practice in bioinformatics by making available new efficient on-disk solutions for sequence analysis. They will also impact a number of other popular application areas including biometrics, image processing, social network, and E-commerce, where processing non-ordered discrete multidimentional data is crucial. This collaborative research project, conducted jointly by the investigators from the Michigan State University (MSU) and the University of Michigan at Dearborn (UM-D), investigates the issues and techniques for storing and searching/querying large scale k-mer data sets for sequence analysis in bioinformatics. Efficient k-mer indexing, storage and retrieval are vital to sequence analysis tasks like error correction as sequencing data set sizes increase vastly. Most existing methods for storing and searching k-mers are optimized for exact or range queries. However, this reliance limits the types of sequence analysis that can be done efficiently. Moreover, most existing methods for storing k-mers do not support efficient storage of k-mers at multiple word lengths. For many sequence analysis problems, searches with multiple word lengths enable better sensitivity and specificity. In this project, various techniques for efficiently supporting so-called (discrete) box queries and other related queries (e.g., hybrid queries) on large scale k-mer data sets for sequence analysis are investigated. In particular, a new index tree, named the BoND-tree, specially designed for a non-ordered discrete data space characterized by k-mer data sets is developed. The unique properties of the space are exploited to develop new node splitting heuristics for the index tree, and theoretical analysis is performed to show the optimality of the proposed heuristics. Besides the BoND-tree, which is based on data partitioning, space-partitioning based index schemes for box quieres in such a space are also developed. To support a more flexible type of query (i.e., hybrid box and range queries), hybrid index schemes integrating strengths of both box query indexes and range query indexes are studied. To facilitate an efficient index construction for large scale k-mer data sets, bulk loading techniques are also developed for the proposed index trees. In addition, the approaches to optimizing box queries in solving sequence analysis problems like the error correction are examined. The storage structure and adoption of box queries for supporting searches with multiple word lengths on k-mer data sets are also explored. The research in the project will result in the discovery of fundamental properties of the data space for sequence data in bioinformatics, the development of a number of novel storage, indexing and retrieval techniques exploiting the properties of such a data space, and the applications of the proposed techniques for solving important problems in sequence analysis. These results will advance the state of knowledge for storage, indexing and retrieval techniques for genome sequence databases. They are expected to significantly impact current practice in bioinformatics by making available new efficient on-disk solutions for sequence analysis. They will also impact a number of other popular application areas including biometrics, image processing, social network, and E-commerce, where processing non-ordered discrete multidimentional data is crucial.
该合作研究项目由密歇根州立大学 (MSU) 和密歇根大学迪尔伯恩分校 (UM-D) 的研究人员联合开展,研究了存储和搜索/查询大规模 k-mer 数据集的问题和技术(即从基因组序列获得的重叠 k 长度子序列,用于生物信息学中的序列分析。随着测序数据集大小的大幅增加,高效的 k 聚体索引、存储和检索对于纠错等序列分析任务至关重要。大多数现有的存储和搜索 k 聚体的方法都针对精确或范围查询进行了优化。然而,这种依赖限制了可以有效完成的序列分析的类型。此外,大多数现有的存储k聚体的方法不支持多字长的k聚体的有效存储。对于许多序列分析问题,包括纠错、变异检测和组装,使用多个字长的搜索可以实现更好的灵敏度和特异性。在该项目中,研究了用于在大规模 k-mer 数据集上有效支持所谓的(离散)盒查询和其他相关查询(例如混合查询)以进行序列分析的各种技术。研究了在解决序列分析问题(如纠错)时优化框查询的方法。探讨了用于支持 k-mer 数据集上多字长搜索的存储结构和盒查询的采用。这项研究的结果将推进基因组序列数据库的存储、索引和检索技术的知识水平。 通过为序列分析提供新的高效磁盘解决方案,预计它们将显着影响生物信息学的当前实践。它们还将影响许多其他流行的应用领域,包括生物识别、图像处理、社交网络和电子商务,在这些领域中,处理无序离散多维数据至关重要。该合作研究项目由密歇根州立大学 (MSU) 和密歇根大学迪尔伯恩分校 (UM-D) 的研究人员联合开展,研究了存储和搜索/查询大规模 k 聚体数据集的问题和技术。生物信息学中的序列分析。随着测序数据集大小的大幅增加,高效的 k 聚体索引、存储和检索对于纠错等序列分析任务至关重要。大多数现有的存储和搜索 k 聚体的方法都针对精确或范围查询进行了优化。然而,这种依赖限制了可以有效完成的序列分析的类型。此外,大多数现有的存储k聚体的方法不支持多字长的k聚体的有效存储。对于许多序列分析问题,使用多个字长的搜索可以实现更好的灵敏度和特异性。在该项目中,研究了用于在大规模 k-mer 数据集上有效支持所谓的(离散)框查询和其他相关查询(例如混合查询)以进行序列分析的各种技术。特别是,开发了一种新的索引树,称为BoND树,专门针对以k聚体数据集为特征的无序离散数据空间而设计。利用空间的独特属性为索引树开发新的节点分裂启发式算法,并进行理论分析以证明所提出的启发式算法的最优性。除了基于数据分区的BoND树之外,还开发了针对此类空间中的框quieres的基于空间分区的索引方案。为了支持更灵活的查询类型(即混合框查询和范围查询),研究了集成框查询索引和范围查询索引的优点的混合索引方案。为了促进大规模 k-mer 数据集的高效索引构建,还为所提出的索引树开发了批量加载技术。此外,还研究了在解决序列分析问题(如纠错)时优化框查询的方法。还探讨了存储结构和采用框查询来支持 k-mer 数据集上的多字长度搜索。该项目的研究将导致生物信息学中序列数据的数据空间的基本属性的发现,利用这种数据空间的属性开发许多新颖的存储、索引和检索技术,以及该数据空间的应用。提出了解决序列分析中重要问题的技术。这些结果将推进基因组序列数据库的存储、索引和检索技术的知识水平。通过为序列分析提供新的高效磁盘解决方案,预计它们将显着影响当前的生物信息学实践。它们还将影响许多其他流行的应用领域,包括生物识别、图像处理、社交网络和电子商务,在这些领域中,处理无序离散多维数据至关重要。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Sakti Pramanik其他文献

Sakti Pramanik的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Sakti Pramanik', 18)}}的其他基金

Collaborative Research: Supporting Efficient Similarity Searches for Multidimensional Non-ordered Discrete Data Spaces
协作研究:支持多维非有序离散数据空间的高效相似性搜索
  • 批准号:
    0414576
  • 财政年份:
    2005
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
SGER: Performance Studies for Indexing Genome Sequence Databases
SGER:索引基因组序列数据库的性能研究
  • 批准号:
    0228983
  • 财政年份:
    2002
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
SGER: Data-Distribution Properties in High Dimensional Euclidean Space and their Applications in Optimizing Multi-Media Database Accesses
SGER:高维欧几里德空间中的数据分布特性及其在优化多媒体数据库访问中的应用
  • 批准号:
    9910605
  • 财政年份:
    1999
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
HICLAS: An Effective Tool for Interoperability Among Taxonomic Database Systems
HICLAS:分类数据库系统之间互操作性的有效工具
  • 批准号:
    9630846
  • 财政年份:
    1996
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
Moving the Ribosome Database Project (RDP) to a DBMS Foundation
将核糖体数据库项目 (RDP) 移至 DBMS 基金会
  • 批准号:
    9507552
  • 财政年份:
    1995
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Continuing Grant
Hierarchic Database Management Systems and Networking for Systematic Biology
系统生物学的分层数据库管理系统和网络
  • 批准号:
    9408384
  • 财政年份:
    1994
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
Arabidopsis Biological Resource Center
拟南芥生物资源中心
  • 批准号:
    9121030
  • 财政年份:
    1991
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Continuing Grant
Hierarchic Database Structures for Implementing Taxonomic Database Systems
用于实现分类数据库系统的分层数据库结构
  • 批准号:
    9021656
  • 财政年份:
    1991
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
Parallel Processing of Multi-Directory Hashing
多目录哈希的并行处理
  • 批准号:
    8706069
  • 财政年份:
    1988
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Continuing Grant

相似国自然基金

小分子代谢物Catechin与TRPV1相互作用激活外周感觉神经元介导尿毒症瘙痒的机制研究
  • 批准号:
    82371229
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
DHEA抑制小胶质细胞Fis1乳酸化修饰减轻POCD的机制
  • 批准号:
    82301369
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
SETDB1调控小胶质细胞功能及参与阿尔茨海默病发病机制的研究
  • 批准号:
    82371419
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
PTBP1驱动H4K12la/BRD4/HIF1α复合物-PKM2正反馈环路促进非小细胞肺癌糖代谢重编程的机制研究及治疗方案探索
  • 批准号:
    82303616
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Collaborative Research: III: Small: High-Performance Scheduling for Modern Database Systems
协作研究:III:小型:现代数据库系统的高性能调度
  • 批准号:
    2322973
  • 财政年份:
    2024
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Small: High-Performance Scheduling for Modern Database Systems
协作研究:III:小型:现代数据库系统的高性能调度
  • 批准号:
    2322974
  • 财政年份:
    2024
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Small: A DREAM Proactive Conversational System
合作研究:III:小型:一个梦想的主动对话系统
  • 批准号:
    2336769
  • 财政年份:
    2024
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Small: A DREAM Proactive Conversational System
合作研究:III:小型:一个梦想的主动对话系统
  • 批准号:
    2336768
  • 财政年份:
    2024
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Small: Efficient and Robust Multi-model Data Analytics for Edge Computing
协作研究:III:小型:边缘计算的高效、稳健的多模型数据分析
  • 批准号:
    2311596
  • 财政年份:
    2023
  • 资助金额:
    $ 27.34万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了