K-mer indexing for pan-genome reference annotation
用于泛基因组参考注释的 K-mer 索引
基本信息
- 批准号:10793082
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-02-22 至 2024-01-31
- 项目状态:已结题
- 来源:
- 关键词:AccelerationAddressAlgorithmsArchitectureBRCA mutationsBiologicalBiomedical ResearchBiteChromosomesClinVarClinicalClinical assessmentsCloud ComputingCodeCollectionCommunitiesComplexDataData SetDatabasesDevelopmentDiploidyDiseaseElementsFoundationsFrequenciesGene FrequencyGenesGenetic AnnotationGenetic CodeGenetic PolymorphismGenetic VariationGenomeGenomicsGoalsHaplotypesHumanHuman BiologyHuman GeneticsHuman GenomeIndividualInfrastructureIntuitionLengthLinkLocationMapsMemoryMetadataMethodsNatureNucleotidesOncogenesPerformancePersonsPhasePopulationPrivacyProcessResearchResearch PersonnelResolutionSamplingSavingsSchemeSequence AnalysisSpeedSystemUpdateVariantWorkclinical applicationclinically relevantcloud basedcommunity engagementcostdata sharingdesignflexibilityfootgenetic variantgenome sciencesgenome sequencinggenomic datahuman diseasehuman reference genomeimprovedindexingnext generationnext generation sequencingnovelpan-genomepopulation basedpreservationreference genomeweb portal
项目摘要
ABSTRACT
The human genome reference sequence is one of the foundations of genome sciences, especially in the context
of next-generation sequencing (NGS) analysis. The reference has enabled discoveries in biomedical research
and been particularly instrumental in human disease gene identification. However, the human genome reference
is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual
flexibility to represent the breadth of human variation. Important elements of individual genomes are either
missed or incorrectly represented. As a solution that will bridge the next generation of reference assemblies with
population genome sequencing studies, we have developed a K-mer-based indexing approach. This method is
more efficient computationally, provides accurate representation in the context of populations and facilitates the
analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational
architecture that will encode and annotate large collections of genomes in the context of a pan-genome
reference.
First, we plan to develop a scalable, efficient K-mer representation of a large collection of haplotype/phased
reference genomes, by 1) generating an index of all K-mers in human reference genome GRCh38 in a manner
that can efficiently store variant information as metadata, and then 2) incrementally updating the K-mer index to
include all novel K-mers derived from ongoing population sequencing efforts, while 3) developing schemes for
directly analyzing compressed genomic data.
Second, we plan to apply K-mer representation to genomic analysis by 1) providing the entirety of known
human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2)
developing functions for our pan-genomic index that supports ultra-rapid queries, such as of clinically important
variants, and 3) linking conventional coordinate information to the K-mer metadata in the pan-genome index to
allow annotating genetic variation to a particular genome reference.
Third, we will create an online web portal for the pan-genome, using cloud computing, to maximize the utility
of our approach, to promote community engagement and to enabling contribution from the research community.
We expect that completion of these aims will provide: a scalable computational architecture which incorporates
the continuous addition of variant information without loss of resolution or accuracy;; rapid query speeds that will
remain nearly constant as the database grows;; a universally accessible portal using cloud computing.
This work will help solve the issues of multiple assemblies. It will improve researchers’ ability to understand
the relationship of variants and disease, while also providing great savings over the long-term in infrastructure
and computing costs.
抽象的
人类基因组参考序列是基因组科学的基础之一,特别是在上下文中
下一代测序(NGS)分析。该参考已使生物医学研究的发现
并且在人类疾病基因鉴定中特别有用。但是,人类基因组参考
受其静态和线性性质的限制。 特别是,当前参考缺乏功能和上下文
灵活性代表人类变异的广度。单个基因组的重要元素是
错过或错误地表示。作为将下一代参考组件桥接的解决方案
人群基因组测序研究,我们开发了一种基于K-MER的索引方法。此方法是
更有效地计算出来,在人群的背景下提供准确的表示,并促进
分析人类基因组。我们的目标是利用该策略来开发强大的计算
将在泛基因组的背景下编码和注释大量基因组的体系结构
参考。
首先,我们计划开发大量单倍型/分阶段的可扩展,有效的K-MER表示
参考基因组,通过1)以某种方式生成人类参考基因组GRCH38中所有K-MER的索引
可以有效地将变体信息作为元数据存储,然后2)将K-MER索引逐渐更新为
包括所有来自正在进行的人口测序工作中的新型K-Mers,而3)制定方案
直接分析压缩基因组数据。
其次,我们计划通过1)将K-MER表示应用于基因组分析
在汇总指数中的人类通用变异在计算上有效且易于理解,2)
为支持超优化查询的泛基因组指数开发功能,例如临床上重要的
变体和3)将常规坐标信息与泛基因组指数中的K-MER元数据联系起来
允许注释通用变异到特定的基因组参考。
第三,我们将使用云计算为Pan-Genome创建一个在线Web门户,以最大化实用程序
我们的方法,促进社区参与并为研究界做出贡献。
我们预计这些目标的完成将提供:可扩展的计算和体系结构,并包含
连续添加变体信息而不会丢失分辨率或准确性;快速查询速度将
随着数据库的增长,保持恒定;使用云计算的普遍访问门户。
这项工作将有助于解决多个组装的问题。它将提高研究人员的理解能力
变体和疾病的关系,同时还为基础设施的长期节省了大量节省
和计算成本。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Unique k-mer sequences for validating cancer-related substitution, insertion and deletion mutations.
- DOI:10.1093/narcan/zcaa034
- 发表时间:2020-12
- 期刊:
- 影响因子:5.1
- 作者:Lee H;Shuaibi A;Bell JM;Pavlichin DS;Ji HP
- 通讯作者:Ji HP
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Hanlee P Ji其他文献
Hanlee P Ji的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Hanlee P Ji', 18)}}的其他基金
Integrating cancer genomics and spatial architecture of tumor infiltrating lymphocytes
整合癌症基因组学和肿瘤浸润淋巴细胞的空间结构
- 批准号:
10637960 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Project 1 - Molecular and Cellular Determinants of High Risk Gastric Precancerous Lesions
项目1——高危胃癌癌前病变的分子和细胞决定因素
- 批准号:
10715762 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Multimodal iterative sequencing of cancer genomes and single tumor cells
癌症基因组和单个肿瘤细胞的多模式迭代测序
- 批准号:
10363694 - 财政年份:2021
- 资助金额:
$ 30万 - 项目类别:
Multimodal iterative sequencing of cancer genomes and single tumor cells
癌症基因组和单个肿瘤细胞的多模式迭代测序
- 批准号:
10112576 - 财政年份:2021
- 资助金额:
$ 30万 - 项目类别:
相似国自然基金
时空序列驱动的神经形态视觉目标识别算法研究
- 批准号:61906126
- 批准年份:2019
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
本体驱动的地址数据空间语义建模与地址匹配方法
- 批准号:41901325
- 批准年份:2019
- 资助金额:22.0 万元
- 项目类别:青年科学基金项目
大容量固态硬盘地址映射表优化设计与访存优化研究
- 批准号:61802133
- 批准年份:2018
- 资助金额:23.0 万元
- 项目类别:青年科学基金项目
IP地址驱动的多径路由及流量传输控制研究
- 批准号:61872252
- 批准年份:2018
- 资助金额:64.0 万元
- 项目类别:面上项目
针对内存攻击对象的内存安全防御技术研究
- 批准号:61802432
- 批准年份:2018
- 资助金额:25.0 万元
- 项目类别:青年科学基金项目
相似海外基金
Implementation of an impact assessment tool to optimize responsible stewardship of genomic data in the cloud
实施影响评估工具以优化云中基因组数据的负责任管理
- 批准号:
10721762 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
A computational model for prediction of morphology, patterning, and strength in bone regeneration
用于预测骨再生形态、图案和强度的计算模型
- 批准号:
10727940 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Unified, Scalable, and Reproducible Neurostatistical Software
统一、可扩展且可重复的神经统计软件
- 批准号:
10725500 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
High-resolution cerebral microvascular imaging for characterizing vascular dysfunction in Alzheimer's disease mouse model
高分辨率脑微血管成像用于表征阿尔茨海默病小鼠模型的血管功能障碍
- 批准号:
10848559 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Bioethical, Legal, and Anthropological Study of Technologies (BLAST)
技术的生物伦理、法律和人类学研究 (BLAST)
- 批准号:
10831226 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别: