The advances of large-scale genomics studies have enabled compilation of cell type–specific, genome-wide DNA functional elements at high resolution. With the growing volume of functional annotation data and sequencing variants, existing variant annotation algorithms lack the efficiency and scalability to process big genomic data, particularly when annotating whole-genome sequencing variants against a huge database with billions of genomic features. Here, we develop VarNote to rapidly annotate genome-scale variants in large and complex functional annotation resources. Equipped with a novel index system and a parallel random-sweep searching algorithm, VarNote shows substantial performance improvements (two to three orders of magnitude) over existing algorithms at different scales. It supports both region-based and allele-specific annotations and introduces advanced functions for the flexible extraction of annotations. By integrating massive base-wise and context-dependent annotations in the VarNote framework, we introduce three efficient and accurate pipelines to prioritize the causal regulatory variants for common diseases, Mendelian disorders, and cancers.
大规模基因组学研究的进展使得能够在高分辨率下汇编细胞类型特异性的全基因组DNA功能元件。随着功能注释数据和测序变异数量的不断增加,现有的变异注释算法缺乏处理大型基因组数据的效率和可扩展性,特别是在针对包含数十亿个基因组特征的庞大数据库对全基因组测序变异进行注释时。在此,我们开发了VarNote,以便在庞大且复杂的功能注释资源中快速注释基因组规模的变异。VarNote配备了一种新颖的索引系统和一种并行随机扫描搜索算法,在不同规模下,相较于现有算法,其性能有了显著提高(2到3个数量级)。它支持基于区域和等位基因特异性的注释,并引入了用于灵活提取注释的高级功能。通过在VarNote框架中整合大量基于碱基和依赖于上下文的注释,我们引入了三种高效且准确的流程,以便对常见疾病、孟德尔疾病和癌症的致病性调控变异进行优先级排序。