Genomic Compression: From Information Theory to Parallel Algorithms
基因组压缩:从信息论到并行算法
基本信息
- 批准号:9259954
- 负责人:
- 金额:$ 30.35万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2015
- 资助国家:美国
- 起止时间:2015-06-01 至 2019-05-31
- 项目状态:已结题
- 来源:
- 关键词:AddressAlgorithmsArchivesAreaArithmeticBig DataBiologicalBiomedical ResearchCategoriesChromosomesCodeComputer softwareDNA sequencingDataData CompressionDatabasesDetectionDimensionsDiseaseEnsureEvaluationFutureGenomeGenomicsGoalsGovernmentGrowthHealth Care ResearchImageryIndividualInformation TheoryKnowledgeMeasurementMedical ResearchMethodsMiningModelingModernizationNucleotidesOutcomeOutcomes ResearchPerformancePositioning AttributeProcessPropertyPsychological TechniquesResearchSchemeSideSorting - Cell MovementSpeedStatistical Data InterpretationTechniquesThe Cancer Genome AtlasTimeTreesUnited States National Institutes of HealthWeightbasecancer genomeclinical practicecomputing resourcescostcrowdsourcingdata accessdata formatdesigndisease-causing mutationexperiencefunctional genomicsgenomic dataimprovedindexingnovelnovel strategiesoperationparallel computerpersonalized medicineprogramspublic health relevancesignal processingstatisticstheorieswhole genome
项目摘要
DESCRIPTION (provided by applicant): One of the highest priorities of modern healthcare research and practice is to identify genomic changes and markers that predispose individuals to debilitating diseases or make them more responsive to certain therapies and emerging treatments. Timely discovery and knowledge mining in this area of medical research is largely enabled by massive DNA sequencing and functional genomic data, the volumes of which are expected to experience drastic growth in the near future. It is therefore of paramount importance to develop efficient, accurate, and low-latency data compression and decompression techniques that will allow for fast exchange, dissemination, random access, visualization and search of diversely formatted genomic information. The use of specialized compression methods for biological data will ensure unprecedented growth of NIH databases and their utility, new uses of crowd-sourced computing in medical research, and large scale dissemination of experimental results. Specific aims of the proposal include developing parallel, task-oriented algorithms for a reference-based and reference-free compression of reads and whole genomes; b) lossy compression of quality scores; and c) compression of functional genomic data. Although the three data categories have different statistical properties and formats, they may be compressed using similar combinations of pre-processing, statistical coding, and parallel algorithms. Furthermore, some of the universal features of the developed compression techniques will make it possible to successfully apply them on other emerging genomic data formats. The long-term objectives of the proposed research program are two-fold. The first objective is to perform fundamental analytical studies of lossless and certain restricted forms of lossy compression and dimensionality reduction methods for genomic and functional genomic data, using information-theoretic techniques. The second objective is to develop a new suite of parallel algorithms for SAM, FASTQ and Wig track data compression. The developed algorithms are expected to include suitably combined, modified and extended classical compression methods (e.g., arithmetic, Huffman, and Lempel-Ziv coding), as well as novel solutions based on context-mixing and context-tree weighting with biological side-information. Immediate goals of the project include using CUDA, as well as classical parallel computing platforms, to implement current compression algorithms in order to reduce the latency of the compression and decompression process. Novel components of the parallel implementations will include extensive use of state-of-the-art hashing, indexing, and stringing methods. SAM, FASTQ and Wig data ¿les are ubiquitous in genomic research. Hence, a research program resulting in high-performance software suites for compression of these and other genomic information formats will enable management, transfer and access to massive data crucial for the operation of governmental and NIH sponsored projects such as ENCODE, TCGA, ClinVar, Genome 10K, the Million Cancer Genome Warehouse, and ADAM.
描述(由适用提供):现代医疗保健研究和实践的最高优先事项之一是确定使人容易使人衰弱的基因组变化和标记,或者使它们对某些疗法和新兴治疗的反应更大。大量的DNA测序和功能性基因组数据在很大程度上可以实现及时的发现和知识挖掘,其大量数据有望体验到剧烈的重要性,因此对于发展有效,准确和低延迟数据的压缩和减压技术至关重要,允许快速交流,随机访问,可访问,可视化,多样性地形成多样化的多样性。使用专门的压缩方法在生物学数据中将确保NIH数据库的前所未有的增长及其效用,在医学研究中的新用途以及对实验结果的大规模传播。该提案的具体目的包括开发平行于任务的算法,用于基于参考的读取和整个基因组的无参考和无参考的压缩; b)质量得分的损失压缩; c)功能基因组数据的压缩。尽管这三个数据类别具有不同的统计属性和格式,但可以使用类似的预处理,统计编码和并行算法来压缩它们。此外,开发的压缩技术的某些通用特征将使成功将它们应用于其他新兴的基因组数据格式成为可能。拟议的研究计划的长期目标是两个方面。第一个目标是使用信息理论技术对无损和某些限制形式的损耗压缩和降低方法进行基本分析研究。第二个目标是为SAM,FASTQ和WIG跟踪数据压缩开发新的并行算法套件。预计开发的算法将包括适当的合并,修改和扩展的经典压缩方法(例如,算术,霍夫曼和LEMPEL-ZIV编码),以及基于上下文混合和具有生物侧面信息的上下文树的加权的新颖解决方案。该项目的直接目标包括使用CUDA以及经典的并行计算平台来实现当前的压缩算法,以减少压缩和减压过程的延迟。并行实现的新组件将包括广泛使用最先进的哈希,索引和串起方法。 SAM,FASTQ和假发数据„ LE在基因组研究中无处不在。因此,一项研究计划,导致高性能软件套件以压缩这些和其他基因组信息格式,将使管理,转移并访问大规模数据,对于政府和NIH赞助的项目的运行至关重要,例如编码,TCGA,TCGA,Clinvar,Genome 10K,基因组10K,Million Cancer Genome Warehouse和Adam。
项目成果
期刊论文数量(27)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores.
- DOI:10.1109/itw.2016.7606808
- 发表时间:2016-09
- 期刊:
- 影响因子:0
- 作者:Ochoa I;No A;Hernaez M;Weissman T
- 通讯作者:Weissman T
Latent Network Features and Overlapping Community Discovery via Boolean Intersection Representations.
通过布尔交集表示的潜在网络特征和重叠社区发现。
- DOI:10.1109/tnet.2017.2728638
- 发表时间:2017
- 期刊:
- 影响因子:0
- 作者:Dau,Hoang;Milenkovic,Olgica
- 通讯作者:Milenkovic,Olgica
Compression for Quadratic Similarity Queries: Finite Blocklength and Practical Schemes.
二次相似性查询的压缩:有限块长度和实用方案。
- DOI:10.1109/tit.2016.2535172
- 发表时间:2016
- 期刊:
- 影响因子:2.5
- 作者:Steiner,Fabian;Dempfle,Steffen;Ingber,Amir;Weissman,Tsachy
- 通讯作者:Weissman,Tsachy
Chained Kullback-Leibler Divergences.
- DOI:10.1109/isit.2016.7541365
- 发表时间:2016-07
- 期刊:
- 影响因子:0
- 作者:Pavlichin DS;Weissman T
- 通讯作者:Weissman T
Aligned genomic data compression via improved modeling.
通过改进的建模来对齐基因组数据压缩。
- DOI:10.1142/s0219720014420025
- 发表时间:2014
- 期刊:
- 影响因子:1
- 作者:Ochoa,Idoia;Hernaez,Mikel;Weissman,Tsachy
- 通讯作者:Weissman,Tsachy
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Olgica Milenkovic其他文献
Olgica Milenkovic的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Olgica Milenkovic', 18)}}的其他基金
Genomic Compression: From Information Theory to Parallel Algorithms
基因组压缩:从信息论到并行算法
- 批准号:
9239305 - 财政年份:2015
- 资助金额:
$ 30.35万 - 项目类别:
Genomic Compression: From Information Theory to Parallel Algorithms
基因组压缩:从信息论到并行算法
- 批准号:
8876278 - 财政年份:2015
- 资助金额:
$ 30.35万 - 项目类别:
相似国自然基金
分布式非凸非光滑优化问题的凸松弛及高低阶加速算法研究
- 批准号:12371308
- 批准年份:2023
- 资助金额:43.5 万元
- 项目类别:面上项目
资源受限下集成学习算法设计与硬件实现研究
- 批准号:62372198
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
基于物理信息神经网络的电磁场快速算法研究
- 批准号:52377005
- 批准年份:2023
- 资助金额:52 万元
- 项目类别:面上项目
考虑桩-土-水耦合效应的饱和砂土变形与流动问题的SPH模型与高效算法研究
- 批准号:12302257
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
面向高维不平衡数据的分类集成算法研究
- 批准号:62306119
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Brain Digital Slide Archive: An Open Source Platform for data sharing and analysis of digital neuropathology
Brain Digital Slide Archive:数字神经病理学数据共享和分析的开源平台
- 批准号:
10735564 - 财政年份:2023
- 资助金额:
$ 30.35万 - 项目类别:
Computer-Aided Triage of Body CT Scans with Deep Learning
利用深度学习对身体 CT 扫描进行计算机辅助分类
- 批准号:
10585553 - 财政年份:2023
- 资助金额:
$ 30.35万 - 项目类别:
Point-of-care diagnostic test for T. cruzi (Chagas) infection
克氏锥虫(恰加斯)感染的即时诊断测试
- 批准号:
10603665 - 财政年份:2023
- 资助金额:
$ 30.35万 - 项目类别:
A visualization interface for BRAIN single cell data, integrating transcriptomics, epigenomics and spatial assays
BRAIN 单细胞数据的可视化界面,集成转录组学、表观基因组学和空间分析
- 批准号:
10643313 - 财政年份:2023
- 资助金额:
$ 30.35万 - 项目类别: