Scalable detection and interpretation of structural variation in human genomes

人类基因组结构变异的可扩展检测和解释

基本信息

批准号：
9973582
负责人：
Aaron R Quinlan
金额：
$ 69.2万
依托单位：
UNIVERSITY OF UTAH
依托单位国家：
美国
项目类别：
财政年份：
2020
资助国家：
美国
起止时间：
2020-05-01 至 2024-02-29
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9973582
关键词：
Acute Affect Algorithmic Software Algorithms All of Us Research Program Alleles Area Automobile Driving Biological Assay Chromatin Structure Chromosome Structures Clip Cloud Computing Code Communities Complex Computer software Copy Number Polymorphism DNA DNA Sequence Data Data Reporting Detection Development Disease Environment Error Sources Exhibits Family Study Funding Future Gene Duplication Gene Expression Gene Fusion Gene Structure Genetic Genetic Diseases Genetic Variation Genome Genomics Genotype Goals Human Human Genome Individual Laboratories Large-Scale Sequencing Location Machine Learning Maps Methods Modeling Noise Nucleotides Paint Pathogenicity Performance Phenotype Population Positioning Attribute Prevalence Process Reciprocal Translocation Research Running Sampling Sea Sensitivity and Specificity Sequence Alignment Series Signal Transduction Software Tools Source Speed Structure Systematic Bias Techniques Technology Training Trans-Omics for Precision Medicine United States National Institutes of Health Untranslated RNA Variant algorithm development base convolutional neural network deep learning developmental disease dosage exome experience genome analysis genome sequencing genome-wide human disease improved innovation insertion/deletion mutation insight large datasets method development nanopore novel prevent research and development software development success tool variant detection whole genome

项目摘要

PROJECT SUMMARY Structural variation (SV), is a diverse class of genome variation that includes copy number variants (CNVs) such as deletions and duplications, as well as balanced rearrangements, such as inversions and reciprocal translocations. A typical human genome harbors >4,000 SVs larger than 300bp and their large size increases the potential to delete or duplicate genes, disrupt chromatin structure, and alter expression. Despite their prevalence and potential for phenotypic consequence, SVs remain notoriously difficult to detect and genotype with high accuracy. Much of this difficulty is driven by the fact DNA sequence alignment “signals” indicating SVs are far more complex than for single-nucleotide and insertion deletion variants. Unlike SNP alignments that vary only in allele state, alignments supporting SVs vary in state (supports an alternate structure or not) alignment location, and type. Consequently, the accuracy of SV discovery is much lower than that of SNPs and INDELs. Furthermore, SV pipelines scale poorly and are difficult to run. These challenges are a barrier for single genome analysis and studies of families must invest substantial effort into eliminating a sea of false positives. These problems become exponentially more acute for large-scale sequencing efforts such as TOPmed, the Centers for Common Disease Genetics, and the All of Us program. Software efficiency is key to scalability for such projects. However, of equal importance is comprehensive, accurate discovery. Building upon more than a decade of software development experience and analyzing SV in diverse disease contexts, we have invested significant effort into understanding the causes of the insufficient accuracy for SV discovery. These efforts, together with our research and development experience in this area, give us unique insight into improving the accuracy and scalability of SV discovery. Our goal is to narrow the accuracy gap between SNP/INDEL variation and structural variation discovery. These developments will empower studies of human genomes in diverse contexts and will therefore have broad impact. Our goals are to: 1. Develop a deep learning model to correct systematic variation in sequence depth. This new machine learning model will correct systematic biases in DNA sequence depth and dramatically improve the discovery of deletions and duplications. 2. Improve the speed, scalability, and accuracy of SV detection and genotyping. Using new algorithms, we will bring the accuracy of SV detection much closer to that of SNP and INDEL discovery and allow accurate SV discovery to be deployed at scale. 3. Create a map of genomic constraint for SV from population-scale genome analysis. We will deploy our new methods to detect and genotype structural variation among tens of thousands of human genomes. The resulting SV map will empower the creation of a model of genomic constraint for SV and enable new software to predict deleterious SVs, especially in the noncoding genome.

项目概要结构变异 (SV) 是一种多样化的基因组变异，包括拷贝数变异 (CNV) 例如删除和重复，以及平衡重排，例如倒置和倒数典型的人类基因组包含超过 4,000 个大于 300bp 的 SV，并且它们的大小会增加。尽管存在删除或复制基因、破坏染色质结构和改变表达的潜力。尽管SVs的患病率和潜在的表型后果仍然难以检测和基因分型这种困难很大程度上是由 DNA 序列比对“信号”表明的事实造成的。与 SNP 比对不同，SV 比单核苷酸和插入缺失变体复杂得多。仅在等位基因状态下变化，支持 SV 的比对在状态上变化（是否支持替代结构）比对位置和类型进行检查后，SV 发现的准确性远低于 SNP 和 SNP 的准确性。此外，SV 管道扩展性差且难以运行。单一基因组分析和家庭研究必须投入大量精力来消除大量虚假信息对于大规模测序工作，例如，这些问题变得更加严重。 TOPmed、常见疾病遗传学中心和 All of Us 计划的软件效率是关键。然而，此类项目的可扩展性同样重要的是全面、准确的发现。以十多年的软件开发经验为基础，对不同领域的 SV 进行分析在疾病背景下，我们投入了大量精力来理解准确性不足的原因这些努力以及我们在该领域的研发经验为我们提供了帮助。对提高 SV 发现的准确性和可扩展性的独特见解我们的目标是缩小准确性。这些发展将增强 SNP/INDEL 变异和结构变异发现之间的差距。在不同背景下对人类基因组进行研究，因此将产生广泛的影响，我们的目标是： 1. 开发深度学习模型来纠正序列深度的系统变化。学习模型将纠正 DNA 序列深度的系统偏差，并显着提高发现删除和重复。 2. 使用新算法提高 SV 检测和基因分型的速度、可扩展性和准确性。我们将使 SV 检测的准确性更加接近 SNP 和 INDEL 发现的准确性，并允许准确的SV发现将被大规模部署。 3. 根据群体规模的基因组分析创建 SV 的基因组约束图。我们的新方法可以检测数以万计的人类基因组中的结构变异并对其进行基因分型。由此产生的 SV 图谱将有助于创建 SV 的基因组约束模型，并启用新的预测有害SV的软件，特别是在非编码基因组中。