CAREER: Model-based compression and probabilistic analysis of non-Markovian sequences
职业:非马尔可夫序列的基于模型的压缩和概率分析
基本信息
- 批准号:2144974
- 负责人:
- 金额:$ 55.95万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-10-01 至 2027-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
This project aims to develop efficient data-compression and analysis methods for large and complex data based on probabilistic models that will facilitate algorithm design, analysis, and evaluation. The project advances flexible probabilistic models capable of accurately representing such data. These models will be leveraged to design scalable analysis and compression algorithms, establish their fundamental limits, and provide provable performance guarantees. In particular, the project will study data-compression algorithms for removing redundancy in large-scale data-storage systems, where traditional compression methods are computationally infeasible. It will also develop novel estimation and testing algorithms for genomic sequences, where existing probabilistic models are too restrictive to faithfully represent their internal statistical structure. The project considers fundamental problems in information theory and statistical signal processing and has the potential to contribute to public health through more accurate statistical analysis of genomic data. The research results will be incorporated in a range of educational activities, including developing interactive and accessible online courses that will emphasize connections between mathematics, engineering, and science, and promote a principled model-based approach to solving engineering and scientific problems. The project has two research thrusts, which correspond to two critical settings in which conventional probabilistic models of sequences, most commonly Markov as well as independent and identically distributed (iid) models, and their associated methods, are inapplicable. The first thrust focuses on sequences with long-range redundancy, i.e., with long repeated blocks appearing at large distances, common in terabyte-scale data storage systems. The project will develop generative data-driven models for sources with approximate repeats, establish information-theoretic bounds on compressing them, and develop and optimize compression algorithms, including compression of distributed sources and universal compression for sources with unknown parameters. The second thrust focuses on evolutionary sources, i.e., those that produce data through consecutive edits, used to model the generation process of genomic data. Problems such as parameter estimation, hypothesis testing, and the prediction of future behavior for evolutionary sources will be addressed by formulating a stochastic approximation framework in which asymptotic and finite-time behavior of sequences are analyzed. The resulting analysis methods and algorithms developed in this thrust will be used to study several problems in bioinformatics.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该项目旨在基于概率模型开发针对大型复杂数据的高效数据压缩和分析方法,以促进算法设计、分析和评估。该项目提出了能够准确表示此类数据的灵活概率模型。这些模型将用于设计可扩展的分析和压缩算法,确定其基本限制,并提供可证明的性能保证。特别是,该项目将研究数据压缩算法,以消除大规模数据存储系统中的冗余,而传统的压缩方法在计算上是不可行的。它还将为基因组序列开发新颖的估计和测试算法,其中现有的概率模型限制太多,无法忠实地表示其内部统计结构。该项目考虑了信息论和统计信号处理中的基本问题,并有可能通过对基因组数据进行更准确的统计分析来为公共卫生做出贡献。研究成果将被纳入一系列教育活动中,包括开发交互式且易于访问的在线课程,这些课程将强调数学、工程和科学之间的联系,并推广基于原则的模型方法来解决工程和科学问题。该项目有两个研究重点,对应于两个关键设置,在这两个关键设置中,传统的序列概率模型(最常见的是马尔可夫)以及独立同分布(iid)模型及其相关方法不适用。第一个重点关注具有长程冗余的序列,即长重复块出现在长距离处,这在 TB 级数据存储系统中很常见。该项目将为具有近似重复的源开发生成数据驱动模型,建立压缩它们的信息论界限,并开发和优化压缩算法,包括分布式源的压缩和未知参数源的通用压缩。第二个重点关注进化源,即通过连续编辑产生数据的源,用于对基因组数据的生成过程进行建模。参数估计、假设检验和进化源未来行为预测等问题将通过制定随机近似框架来解决,在该框架中分析序列的渐近和有限时间行为。由此产生的分析方法和算法将用于研究生物信息学中的几个问题。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Farzad Farnoud其他文献
HyDRA: gene prioritization via hybrid distance-score rank aggregation
HyDRA:通过混合距离分数排名聚合进行基因优先排序
- DOI:
10.1093/bioinformatics/btu766 - 发表时间:
2015-04-01 - 期刊:
- 影响因子:5.8
- 作者:
Minji Kim;Farzad Farnoud;O. Milenkovic - 通讯作者:
O. Milenkovic
Constrained Code for Data Storage in DNA via Nanopore Sequencing
通过纳米孔测序在 DNA 中存储数据的约束代码
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Kallie Whritenour;M. Civelek;Farzad Farnoud - 通讯作者:
Farzad Farnoud
On the Multimessage Capacity Region for Undirected Ring Networks
论无向环网的多报文容量域
- DOI:
10.1109/tit.2010.2040866 - 发表时间:
2010-04-01 - 期刊:
- 影响因子:2.5
- 作者:
S. M. T. Yazdi;S. Savari;G. Kramer;Kelli Carlson;Farzad Farnoud - 通讯作者:
Farzad Farnoud
Noise and uncertainty in string-duplication systems
字符串复制系统中的噪声和不确定性
- DOI:
- 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
Siddhartha Jain;Farzad Farnoud;Moshe Schwartz;Jehoshua Bruck - 通讯作者:
Jehoshua Bruck
Duplication-correcting codes for data storage in the DNA of living organisms
用于生物体 DNA 数据存储的重复校正代码
- DOI:
10.1109/isit.2016.7541455 - 发表时间:
2016-06-01 - 期刊:
- 影响因子:0
- 作者:
Siddhartha Jain;Farzad Farnoud;Moshe Schwartz;Jehoshua Bruck - 通讯作者:
Jehoshua Bruck
Farzad Farnoud的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Farzad Farnoud', 18)}}的其他基金
Collaborative Research: CIF: Small: Versatile Data Synchronization: Novel Codes and Algorithms for Practical Applications
合作研究:CIF:小型:多功能数据同步:实际应用的新颖代码和算法
- 批准号:
2312871 - 财政年份:2023
- 资助金额:
$ 55.95万 - 项目类别:
Standard Grant
CIF: Small: Collaborative Research: Rank Aggregation with Heterogeneous Information Sources: Efficient Algorithms and Fundamental Limits
CIF:小型:协作研究:异构信息源的排名聚合:高效算法和基本限制
- 批准号:
1908544 - 财政年份:2019
- 资助金额:
$ 55.95万 - 项目类别:
Standard Grant
CRII: CIF: Model-based Compression of Biological Sequences
CRII:CIF:基于模型的生物序列压缩
- 批准号:
1755773 - 财政年份:2018
- 资助金额:
$ 55.95万 - 项目类别:
Standard Grant
CIF: NSF-BSF: Small: Collaborative Research: Characterization and Mitigation of Noise in a Live DNA Storage Channel
CIF:NSF-BSF:小型:合作研究:活体 DNA 存储通道中噪声的表征和缓解
- 批准号:
1816409 - 财政年份:2018
- 资助金额:
$ 55.95万 - 项目类别:
Standard Grant
相似国自然基金
基于人源类器官模型的锂电池产业金属混合暴露肾脏毒性检测及职业健康风险评估研究
- 批准号:82373546
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
职业玩家对房屋短租平台定价的影响:基于博弈模型的分析
- 批准号:71902018
- 批准年份:2019
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
代际间的职业继承对劳动生产率的影响——基于多部门异质性个体跨代职业选择模型的反事实实验
- 批准号:71703180
- 批准年份:2017
- 资助金额:17.0 万元
- 项目类别:青年科学基金项目
基于授权理论的护理人员职业稳定性影响机制模型研究
- 批准号:71704132
- 批准年份:2017
- 资助金额:18.0 万元
- 项目类别:青年科学基金项目
基于社会嵌入理论的公立医院医师职业精神模型构建及制度优化策略
- 批准号:71573094
- 批准年份:2015
- 资助金额:48.0 万元
- 项目类别:面上项目
相似海外基金
CAREER: Efficient Large Language Model Inference Through Codesign: Adaptable Software Partitioning and FPGA-based Distributed Hardware
职业:通过协同设计进行高效的大型语言模型推理:适应性软件分区和基于 FPGA 的分布式硬件
- 批准号:
2339084 - 财政年份:2024
- 资助金额:
$ 55.95万 - 项目类别:
Continuing Grant
Neurodevelopment of executive function, appetite regulation, and obesity in children and adolescents
儿童和青少年执行功能、食欲调节和肥胖的神经发育
- 批准号:
10643633 - 财政年份:2023
- 资助金额:
$ 55.95万 - 项目类别:
Examining co-production as an implemntation strategy for autism early intervention delivered in Part C service systems
检验联合生产作为 C 部分服务系统中自闭症早期干预的实施策略
- 批准号:
10663459 - 财政年份:2023
- 资助金额:
$ 55.95万 - 项目类别:
Defining the neural basis for persistent obesity
定义持续性肥胖的神经基础
- 批准号:
10735128 - 财政年份:2023
- 资助金额:
$ 55.95万 - 项目类别:
Hawaii Minority Health and Cancer Disparities SPORE
夏威夷少数民族健康与癌症差异 SPORE
- 批准号:
10716152 - 财政年份:2023
- 资助金额:
$ 55.95万 - 项目类别: