Elucidating functionally similar orthologous regulatory regions for human and model organism genomes is critical for exploiting model organism research and advancing our understanding of results from genome-wide association studies (GWAS). Sequence conservation is the de facto approach for finding orthologous non-coding regions between human and model organism genomes. However, existing methods for mapping non-coding genomic regions across species are challenged by the multi-mapping, low precision, and low mapping rate issues.
We develop Adaptive liftOver (AdaLiftOver), a large-scale computational tool for identifying functionally similar orthologous non-coding regions across species. AdaLiftOver builds on the UCSC liftOver framework to extend the query regions and prioritizes the resulting candidate target regions based on the conservation of the epigenomic and the sequence grammar features. Evaluations of AdaLiftOver with multiple case studies, spanning both genomic intervals from epigenome datasets across a wide range of model organisms and GWAS SNPs, yield AdaLiftOver as a versatile method for deriving hard-to-obtain human epigenome datasets as well as reliably identifying orthologous loci for GWAS SNPs.
The R package and the data for AdaLiftOver is available from https://github.com/keleslab/AdaLiftOver.
阐明人类和模式生物基因组中功能相似的直系同源调控区域对于利用模式生物研究以及增进我们对全基因组关联研究(GWAS)结果的理解至关重要。序列保守性是寻找人类和模式生物基因组之间直系同源非编码区域的实际方法。然而,现有的跨物种映射非编码基因组区域的方法受到多重映射、精度低和映射率低等问题的挑战。
我们开发了自适应liftOver(AdaLiftOver),这是一种用于识别跨物种功能相似的直系同源非编码区域的大规模计算工具。AdaLiftOver基于UCSC liftOver框架来扩展查询区域,并根据表观基因组和序列语法特征的保守性对产生的候选目标区域进行优先级排序。通过多个案例研究对AdaLiftOver进行评估,这些案例研究涵盖了来自多种模式生物的表观基因组数据集的基因组区间以及GWAS单核苷酸多态性(SNP),结果表明AdaLiftOver是一种通用的方法,可用于获取难以获得的人类表观基因组数据集以及可靠地识别GWAS SNP的直系同源位点。
AdaLiftOver的R包和数据可从https://github.com/keleslab/AdaLiftOver获取。