AF: Small: Theoretical Aspects of Repetition-Aware Text Compression and Indexing

AF:小:重复感知文本压缩和索引的理论方面

基本信息

项目摘要

Being able to store, search, and analyze massive data sets efficiently is one of today's pressing challenges. To that end, this project will study a collection of problems under text compression and indexing with tremendous current relevance, owing to a specific characteristic prevalent in many modern text data sets, called high repetitiveness. This characteristic makes the data highly compressible using some specialized schemes. However, the theoretical understanding of those schemes is still in a nascent stage. This project will address some of the fundamental open problems on the effectiveness of several schemes that are popular in practice. This project will also introduce new ideas for indexing such data in a highly space-efficient manner and quickly support various (application-specific) queries. The techniques developed in this project will apply to a broad class of algorithmic problems and applications; therefore, they will have a lasting impact on related fields. The main results stem from this research will be disseminated through major conferences, journals, and workshop tutorials. The participation of undergraduates and minorities, including women, will be ensured. Suffix trees and suffix arrays are two fundamental data structures for text indexing with many applications in bioinformatics. However, they are notorious for their space complexity. The FM-index index encodes suffix arrays in space close to the entropy-based lower bound using the Burrows-Wheeler Transformation (BWT). But, the entropy model of compression is less effective in capturing repetitiveness. Therefore, modern applications urge even more space frugal encodings that exploit repetitiveness. Although the community has made some recent progress in exact pattern matching, solutions to many other (more complex) matching problems are still open. To that end, this project will attempt to design repetition-aware indexes for pattern matching under mismatches, edits, and wildcards, along with efficient construction algorithms. This project also aims to develop a unified framework to compress several advanced suffix-tree variants (quasi-suffix trees) that support even more complex matching paradigms such as parameterized and order-isomorphic matching. Novel techniques and concepts that go beyond the existing BWT-based methods will be sought. Additionally, the project will attempt a deeper study on various aspects of BWT-runs and the relative Lempel-Ziv compression scheme from the hardness side.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
能够有效地存储、搜索和分析海量数据集是当今紧迫的挑战之一。为此,该项目将研究文本压缩和索引下的一系列问题,这些问题与当前具有巨大的相关性,因为许多现代文本数据集中普遍存在一种称为高重复性的特定特征。这一特性使得数据可以使用一些专门的方案进行高度压缩。然而,对这些方案的理论理解仍处于初级阶段。该项目将解决一些关于实践中流行的几种方案有效性的基本开放问题。该项目还将引入以高度节省空间的方式索引此类数据的新想法,并快速支持各种(特定于应用程序的)查询。该项目开发的技术将适用于广泛的算法问题和应用;因此,它们将对相关领域产生持久的影响。这项研究的主要结果将通过主要会议、期刊和研讨会教程进行传播。将确保本​​科生和包括女性在内的少数民族的参与。后缀树和后缀数组是文本索引的两种基本数据结构,在生物信息学中有许多应用。然而,它们因其空间复杂性而臭名昭著。 FM-index 索引使用 Burrows-Wheeler 变换 (BWT) 对接近基于熵的下界的空间中的后缀数组进行编码。但是,压缩的熵模型在捕获重复性方面效果较差。因此,现代应用程序迫切需要利用重复性的更节省空间的编码。尽管社区最近在精确模式匹配方面取得了一些进展,但许多其他(更复杂)匹配问题的解决方案仍然开放。为此,该项目将尝试设计重复感知索引,用于不匹配、编辑和通配符下的模式匹配,以及高效的构造算法。该项目还旨在开发一个统一的框架来压缩几种高级后缀树变体(准后缀树),这些变体支持更复杂的匹配范例,例如参数化和顺序同构匹配。我们将寻求超越现有基于 BWT 的方法的新技术和概念。此外,该项目将尝试从硬度方面对 BWT 运行和相对 Lempel-Ziv 压缩方案的各个方面进行更深入的研究。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力优点和技术进行评估,被认为值得支持。更广泛的影响审查标准。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
On the Complexity of Recognizing Wheeler Graphs
论识别惠勒图的复杂性
  • DOI:
    10.1007/s00453-021-00917-5
  • 发表时间:
    2022-03
  • 期刊:
  • 影响因子:
    1.1
  • 作者:
    Gibney, Daniel;Thankachan, Sharma V.
  • 通讯作者:
    Thankachan, Sharma V.
The Complexity of Approximate Pattern Matching on de Bruijn Graphs
de Bruijn 图上近似模式匹配的复杂性
  • DOI:
    10.1007/978-3-031-04749-7_16
  • 发表时间:
    2022-01
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Gibney, Daniel;Thankachan, Sharma V.;Aluru, Srinivas
  • 通讯作者:
    Aluru, Srinivas
The Heaviest Induced Ancestors Problem: Better Data Structures and Applications
最严重的诱发祖先问题:更好的数据结构和应用程序
  • DOI:
    10.1007/s00453-022-00955-7
  • 发表时间:
    2022-07
  • 期刊:
  • 影响因子:
    1.1
  • 作者:
    Abedin, Paniz;Hooshmand, Sahar;Ganguly, Arnab;Thankachan, Sharma V.
  • 通讯作者:
    Thankachan, Sharma V.
Co-linear Chaining with Overlaps and Gap Costs
具有重叠和间隙成本的共线链接
  • DOI:
    10.1007/978-3-031-04749-7_15
  • 发表时间:
    2022-01
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jain, Chirag;Gibney, Daniel;Thankachan, Sharma V.
  • 通讯作者:
    Thankachan, Sharma V.
Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk)
高级模式匹配问题的紧凑文本索引:参数化、顺序同构、2D 等(特邀演讲)
  • DOI:
    10.4230/lipics.cpm.2022.3
  • 发表时间:
    2022-01
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Thankachan; Sharma V.
  • 通讯作者:
    Sharma V.
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Sharma Thankachan其他文献

Sharma Thankachan的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Sharma Thankachan', 18)}}的其他基金

REU Site: Algorithm Design --- Theory and Engineering
REU网站:算法设计---理论与工程
  • 批准号:
    2349179
  • 财政年份:
    2024
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
CAREER: Algorithmic Aspects of Pan-genomic Data Modeling, Indexing and Querying
职业:泛基因组数据建模、索引和查询的算法方面
  • 批准号:
    2316691
  • 财政年份:
    2023
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Continuing Grant
AF: Small: Theoretical Aspects of Repetition-Aware Text Compression and Indexing
AF:小:重复感知文本压缩和索引的理论方面
  • 批准号:
    2315822
  • 财政年份:
    2023
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
CAREER: Algorithmic Aspects of Pan-genomic Data Modeling, Indexing and Querying
职业:泛基因组数据建模、索引和查询的算法方面
  • 批准号:
    2146003
  • 财政年份:
    2022
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Continuing Grant
NSF Student Travel Grant for Workshop on String Algorithms in Bioinformatics (StringBio), 2019
NSF 学生生物信息学字符串算法研讨会 (StringBio) 旅行补助金,2019
  • 批准号:
    1946289
  • 财政年份:
    2019
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
NSF Student Travel Grant for 2018 International Workshop on String Algorithms in Bioinformatics (StringBio)
NSF 学生旅费资助 2018 年生物信息学字符串算法国际研讨会 (StringBio)
  • 批准号:
    1849136
  • 财政年份:
    2018
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
AF: Medium: Collaborative Research: Sequential and Parallel Algorithms for Approximate Sequence Matching with Applications to Computational Biology
AF:媒介:协作研究:近似序列匹配的顺序和并行算法及其在计算生物学中的应用
  • 批准号:
    1703489
  • 财政年份:
    2017
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant

相似国自然基金

基于翻译组学理论探究LncRNA H19编码多肽PELRM促进小胶质细胞活化介导电针巨刺改善膝关节术后疼痛的机制研究
  • 批准号:
    82305399
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于小增益理论的物联网聚合计算鲁棒稳定性分析
  • 批准号:
    62303112
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
面向高阶谐振网络与复杂调制方式的谐振变换器统一多频率小信号建模理论研究
  • 批准号:
    52307196
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

AF: Small: Theoretical Aspects of Repetition-Aware Text Compression and Indexing
AF:小:重复感知文本压缩和索引的理论方面
  • 批准号:
    2315822
  • 财政年份:
    2023
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
AF: CQIS: Small: Theoretical Problems in Quantum Information
AF:CQIS:小:量子信息中的理论问题
  • 批准号:
    1717523
  • 财政年份:
    2017
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
AF: Small: THEORETICAL AND ALGORITHMIC FOUNDATIONS OF CONSTRAINED PARTICLE FILTERING
AF:小:约束粒子过滤的理论和算法基础
  • 批准号:
    1527822
  • 财政年份:
    2015
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
AF: Small: Theoretical Frameworks for Modern Parallel Computing Environments
AF:小型:现代并行计算环境的理论框架
  • 批准号:
    1320675
  • 财政年份:
    2013
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
AF: CIF: Small: Theoretical Problems in Quantum Cmputation and Cmmunication
AF:CIF:小:量子计算和通信中的理论问题
  • 批准号:
    1216729
  • 财政年份:
    2012
  • 资助金额:
    $ 44.98万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了