Long read based sequencing software for the comprehensive analysis of clinical samples

基于长读长的测序软件，用于临床样本的综合分析

基本信息

批准号：
10009727
负责人：
TIMOTHY J DURFEE
金额：
$ 75万
依托单位：
DNASTAR, INC.
依托单位国家：
美国
项目类别：
财政年份：
2020
资助国家：
美国
起止时间：
2020-04-01 至 2022-03-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10009727
关键词：
Biological Sciences Candidate Disease Gene Catalogs Clinical Collaborations Computer software Computers DNA DNA Resequencing Data Data Set Detection Development Disease Ensure Environment Evaluation Feedback Gene Family Genes Genome Genomic DNA Goals Haplotypes Heterozygote Hour Individual Kidney Diseases Laboratories Length Methods Nature Phase Polishes Population Privacy Process Provider Pseudogenes Research Personnel Running Sampling Series Statistical Methods Stream Structure Targeted Resequencing Technology Time Variant Visualization analytical tool annotation system base clinical application clinical sequencing cohort contig cost cost effective disorder prevention genetic variant genome browser genome sequencing genome-wide improved insertion/deletion mutation instrument interest nanopore next generation sequencing novel parallel processing prototype reference genome software development tool trait whole genome

项目摘要

The high cost and complexity of the analysis of whole genome resequencing remain prohibitive for most clinical applications. Targeted resequencing allows regions of interest to be enriched from a genomic DNA sample and sequenced to high depth allowing cost-effective identification of important variants. In combination with next- generation sequencing (NGS), the approach has been exploited to tremendous effect in identifying candidate genes and variants for an array of diseases and traits from cohorts and populations as well as individual clinical samples. However, the short read nature of NGS technologies severely limits its potential to characterize, for example, compound heterozygotes due to the lack of long range connectivity needed for haplotype phasing and structural variants (SV). Those limitations can be overcome with long read data from Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT). Moreover, new targeting methods tailored toward long read sequencing are being developed such that a comprehensive analysis of key regions in an individual’s genome will soon be within reach. However, an integrated software solution that is easy enough for clinical researchers to efficiently use is sorely lacking. The overall goal of this Direct to Phase II proposal is to develop commercial-grade software that produces a comprehensive catalog of annotated haplotype phased variants from clinical sequencing data and presents them to clinical researchers through a single easy-to-use application with both analytical and genome browsing capabilities, GenVision Ultra. The proposal focuses on augmenting our highly extensible XNG assembly pipeline with tools necessary for fully automated detection and annotation of all classes of variants from haplotype phased sequences. Novel adaptions to core XNG components will partition reads matching the reference from those likely representing a SV for parallel processing (Aim 1). Matching reads will be aligned to the reference using XNG while the putative SV-containing reads will be de novo assembled and annotated using our long read assembler (LRA). Reference-based alignments will be phased using a novel Bayesian classifier to produce two haplotype sequences prior to SNV/small indel calling and annotation (Aim 2). Short read polishing of the entire assembly will be available on demand. Complete small variant and SV profiles as well as the underlying assembly data will be accessible to the end user in GenVision Ultra. In addition, the application will have discrete filtering and statistical tools with which to identify genes and/or variants of interest in an individual sample or across a cohort/population (Aim 3). To ensure that the software meets the clinical sequencing market needs, Arkana Laboratories has agreed to provide ONT and Illumina sequence data from highly curated HapMap control samples processed with their kidney disease gene panels. Those real-world data sets together with expert interpretation and feedback by Arkana researchers provide an ideal environment in which to develop an outstanding software solution for this critical market (Aim 4).

对于大多数临床，整个基因组分析的高成本和复杂性仍然禁止申请。有针对性的重新方程允许从基因组DNA样本中富集感兴趣的区域，并且对高深度进行测序，允许对重要变体的具有成本效益的识别。结合下一步生成测序（NGS），已经探索了该方法在识别候选人方面具有巨大影响来自队列和人群的一系列疾病和特征的基因和变体以及个人临床样品。但是，NGS技术的简短阅读性质严重限制了其表征的潜力，因为例如，由于缺乏单倍型相位所需的远距离连通性，复合杂合子和结构变体（SV）。通过太平洋生物科学（PACBIO）的长读数据可以克服这些局限性或牛津纳米孔技术（ONT）。此外，针对长阅读量身定制的新目标方法正在开发测序，使得对个体基因组中的关键区域进行全面分析很快就会到达。但是，一种集成的软件解决方案对于临床研究人员来说很容易有效使用是非常缺乏的。直接向第二阶段提案的总体目标是开发产生的商业级软件来自临床测序数据的注释单倍型分阶段变体的全面目录他们通过分析和基因组浏览的单个易于使用的应用到临床研究人员功能，Genvision Ultra。该提案重点是增强我们高度可扩展的XNG组装管道使用单倍型的所有类别变体的全自动检测和注释所需的工具分阶段序列。对核心XNG组件的新颖适应将分区读取与参考的参考那些可能代表平行处理的SV（AIM 1）。匹配读数将与参考使用XNG使用XNG，而推定的含SV的读取将开始组装并使用我们的长读取汇编器（LRA）。基于参考的对齐将使用新型的贝叶斯分类器进行逐步逐步逐步分析在SNV/小型indel调用和注释之前（AIM 2）之前的单倍型序列（AIM 2）。整个整个抛光大会将按需提供。完整的小型变体和SV轮廓以及基础 GenVision Ultra中最终用户可以访问组装数据。此外，该应用程序将具有离散过滤和统计工具可以在单个样本中识别基因和/或感兴趣的变体的过滤和统计工具在队列/人群中（AIM 3）。为了确保软件满足临床测序市场需求， Arkana实验室已同意通过高度策划的HAPMAP控制提供ONT和Illumina序列数据用肾脏疾病基因面板处理的样品。这些真实的数据集与专家阿卡纳研究人员的解释和反馈提供了一个理想的环境，以开发一个这个关键市场的出色软件解决方案（AIM 4）。