K-mer indexing for pan-genome reference annotation

用于泛基因组参考注释的 K-mer 索引

基本信息

批准号：
10793082
负责人：
Hanlee P Ji
金额：
$ 30万
依托单位：
STANFORD UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2023
资助国家：
美国
起止时间：
2023-02-22 至 2024-01-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10793082
关键词：
Acceleration Address Algorithms Architecture BRCA mutations Biological Biomedical Research Bite Chromosomes ClinVar Clinical Clinical assessments Cloud Computing Code Collection Communities Complex Data Data Set Databases Development Diploidy Disease Elements Foundations Frequencies Gene Frequency Genes Genetic Annotation Genetic Code Genetic Polymorphism Genetic Variation Genome Genomics Goals Haplotypes Human Human Biology Human Genetics Human Genome Individual Infrastructure Intuition Length Link Location Maps Memory Metadata Methods Nature Nucleotides Oncogenes Performance Persons Phase Population Privacy Process Research Research Personnel Resolution Sampling Savings Scheme Sequence Analysis Speed System Update Variant Work clinical application clinically relevant cloud based community engagement cost data sharing design flexibility foot genetic variant genome sciences genome sequencing genomic data human disease human reference genome improved indexing next generation next generation sequencing novel pan-genome population based preservation reference genome web portal

项目摘要

ABSTRACT The human genome reference sequence is one of the foundations of genome sciences, especially in the context of next-generation sequencing (NGS) analysis. The reference has enabled discoveries in biomedical research and been particularly instrumental in human disease gene identification. However, the human genome reference is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual flexibility to represent the breadth of human variation. Important elements of individual genomes are either missed or incorrectly represented. As a solution that will bridge the next generation of reference assemblies with population genome sequencing studies, we have developed a K-mer-based indexing approach. This method is more efficient computationally, provides accurate representation in the context of populations and facilitates the analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational architecture that will encode and annotate large collections of genomes in the context of a pan-genome reference. First, we plan to develop a scalable, efficient K-mer representation of a large collection of haplotype/phased reference genomes, by 1) generating an index of all K-mers in human reference genome GRCh38 in a manner that can efficiently store variant information as metadata, and then 2) incrementally updating the K-mer index to include all novel K-mers derived from ongoing population sequencing efforts, while 3) developing schemes for directly analyzing compressed genomic data. Second, we plan to apply K-mer representation to genomic analysis by 1) providing the entirety of known human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2) developing functions for our pan-genomic index that supports ultra-rapid queries, such as of clinically important variants, and 3) linking conventional coordinate information to the K-mer metadata in the pan-genome index to allow annotating genetic variation to a particular genome reference. Third, we will create an online web portal for the pan-genome, using cloud computing, to maximize the utility of our approach, to promote community engagement and to enabling contribution from the research community. We expect that completion of these aims will provide: a scalable computational architecture which incorporates the continuous addition of variant information without loss of resolution or accuracy;; rapid query speeds that will remain nearly constant as the database grows;; a universally accessible portal using cloud computing. This work will help solve the issues of multiple assemblies. It will improve researchers’ ability to understand the relationship of variants and disease, while also providing great savings over the long-term in infrastructure and computing costs.

抽象的人类基因组参考序列是基因组科学的基础之一，特别是在上下文中下一代测序（NGS）分析。该参考已使生物医学研究的发现并且在人类疾病基因鉴定中特别有用。但是，人类基因组参考受其静态和线性性质的限制。特别是，当前参考缺乏功能和上下文灵活性代表人类变异的广度。单个基因组的重要元素是错过或错误地表示。作为将下一代参考组件桥接的解决方案人群基因组测序研究，我们开发了一种基于K-MER的索引方法。此方法是更有效地计算出来，在人群的背景下提供准确的表示，并促进分析人类基因组。我们的目标是利用该策略来开发强大的计算将在泛基因组的背景下编码和注释大量基因组的体系结构参考。首先，我们计划开发大量单倍型/分阶段的可扩展，有效的K-MER表示参考基因组，通过1）以某种方式生成人类参考基因组GRCH38中所有K-MER的索引可以有效地将变体信息作为元数据存储，然后2）将K-MER索引逐渐更新为包括所有来自正在进行的人口测序工作中的新型K-Mers，而3）制定方案直接分析压缩基因组数据。其次，我们计划通过1）将K-MER表示应用于基因组分析在汇总指数中的人类通用变异在计算上有效且易于理解，2）为支持超优化查询的泛基因组指数开发功能，例如临床上重要的变体和3）将常规坐标信息与泛基因组指数中的K-MER元数据联系起来允许注释通用变异到特定的基因组参考。第三，我们将使用云计算为Pan-Genome创建一个在线Web门户，以最大化实用程序我们的方法，促进社区参与并为研究界做出贡献。我们预计这些目标的完成将提供：可扩展的计算和体系结构，并包含连续添加变体信息而不会丢失分辨率或准确性；快速查询速度将随着数据库的增长，保持恒定；使用云计算的普遍访问门户。这项工作将有助于解决多个组装的问题。它将提高研究人员的理解能力变体和疾病的关系，同时还为基础设施的长期节省了大量节省和计算成本。