Machine learning approaches for improved accuracy and speed in sequence annotation

用于提高序列注释的准确性和速度的机器学习方法

基本信息

批准号：
10020995
负责人：
Travis John Wheeler
金额：
$ 28.75万
依托单位：
UNIVERSITY OF MONTANA
依托单位国家：
美国
项目类别：
财政年份：
2019
资助国家：
美国
起止时间：
2019-09-20 至 2023-07-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10020995
关键词：
Address Algorithms Architecture Bioinformatics Biological Classification Collection Communities Complex Computer Vision Systems Computer software Consumption Custom DNA Transposable Elements Data Set Deletion Mutation Descriptor Development Error Sources Evolution Foundations Genome Genomics Hour Human Human Genome Industry Standard Insertion Mutation Institutes Intervention Joints Label Letters Licensing Machine Learning Manuals Masks Methods Modeling Modernization Molecular Biology Network-based Nucleotides Pattern Pilot Projects Proteins Repetitive Sequence Sequence Alignment Sequence Analysis Source Speed Statistical Models Takifugu Work annotation system artificial neural network base bioinformatics tool computing resources convolutional neural network deep learning density design genomic data improved markov model neural network architecture novel novel strategies open source software development statistics success tool

项目摘要

Summary/Abstract Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. Here, we describe Machine Learning approaches to improve both accuracy and speed of highly- sensitive sequence alignment. To improve accuracy, we develop methods to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We describe approaches based on both hidden Markov models and Artificial Neural Networks to dramatically reduce these sorts of sequence annotation error. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step. The results of these efforts will be incorporated into forks of the open source sequence alignment tools HMMER, MMSeqs, and (where appropriate) BLAST; we will also work with community developers of annotation pipelines, such as RepeatMasker and IMG/M, to incorporate these approaches. The development and incorporation into these widely used bioinformatics tools will lead to widespread impact on sequence annotation efforts.

摘要/摘要生物序列的比对是理解其进化、功能和模式的关键一步活动。在这里，我们描述了机器学习方法，以提高高度的准确性和速度。敏感的序列比对。为了提高准确性，我们开发了减少错误注释的方法由（1）低复杂性和重复序列的存在以及（2）过度扩展引起将真正的同源物与不相关的序列进行比对。我们描述了基于隐藏的方法马尔可夫模型和人工神经网络可显着减少此类序列注释错误。我们还通过开发定制的深度学习来解决注释速度的问题旨在快速过滤掉大部分候选序列比较的架构相对较慢的序列比对步骤。这些努力的结果将被纳入到分叉中开源序列比对工具 HMMER、MMSeqs 和（如果适用）BLAST；我们也会与注释管道的社区开发人员（例如 RepeatMasker 和 IMG/M）合作，将这些方法。这些广泛使用的生物信息学工具的开发和整合将导致对序列注释工作产生广泛影响。