Finding Protein Sequence Motifs--methods And Applications

寻找蛋白质序列基序——方法和应用

基本信息

批准号：
9555730
负责人：
Eugene V Koonin
金额：
$ 31.91万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/9555730
关键词：
Actinomyces Infections Amino Acid Motifs Amino Acid Sequence Animals Antitumor Response Apoptosis Archaeal Genome Architecture Bacteria Bacterial Genome Biological Capsid Proteins Censuses Classification Clustered Regularly Interspaced Short Palindromic Repeats Collection Complex Custom DNA Death Domain Development Disease Dissection Eukaryota Evolution Family Family member Generations Genes Genome Genome engineering Genomics Goals Homology Modeling Human Individual Investigation Lead Libraries Life Methodology Methods Mobile Genetic Elements Nomenclature Organism Pattern Periodicity Phenotype Planet Earth Positioning Attribute Process Prokaryotic Cells Property Protein Analysis Protein Family Protein Structure Initiative Proteins RNA Binding Recruitment Activity Regulation Research Route SAM Domain Signal Transduction Structure System Tertiary Protein Structure Variant Viral Viral Genome Virion Virus Work adaptive immunity database structure design exhaustion experience genetic element genome editing markov model microbial molecular sequence database novel nucleoside triphosphatase polymerization protein profiling protein structure sample fixation tool trait

项目摘要

The rapid accumulation of genome sequences and protein structures during the last decade has been paralleled by major advances in sequence database search methods. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI forms the basis of our work on protein motif analysis. In addition, Hidden Markov Models (HMM), protein profile-against-profile comparison implemented in the HHSearch method, protein structure comparison methods, homology modeling of protein structure and genome context analysis were extensively and increasingly applied. Furthermore, custom libraries of protein domain profiles as well as computational pipelines for novel domain identification have been developed and applied. The research performed over the last year, has led to further progress in the study of the classification, evolution, and functions of several classes of proteins and domains. In particular, we have performed a comprehensive analysis of the relationships among viral capsid proteins. Viruses are the most abundant biological entities on earth and show remarkable diversity of genome sequences, replication and expression strategies, and virion structures. Evolutionary genomics of viruses revealed many unexpected connections but the general scenario(s) for the evolution of the virosphere remains a matter of intense debate among proponents of the cellular regression, escaped genes, and primordial virus world hypotheses. A comprehensive sequence and structure analysis of major virion proteins indicates that they evolved on about 20 independent occasions, and in some of these cases likely ancestors are identifiable among the proteins of cellular organisms. Virus genomes typically consist of distinct structural and replication modules that recombine frequently and can have different evolutionary trajectories. The results of this analysis suggest that, although the replication modules of at least some classes of viruses might descend from primordial selfish genetic elements, bona fide viruses evolved on multiple, independent occasions throughout the course of evolution by the recruitment of diverse host proteins that became major virion components. In another project, we performed a detailed analysis and classification of the protein domains that comprise the Class 2 CRISPR-Cas systems, the microbial defense machinery that has been recently exploited for development of a new generation of genome editing tools. Class 2 CRISPR-Cas systems are characterized by effector modules that consist of a single multidomain protein, such as Cas9 or Cpf1. We designed a computational pipeline for the discovery of novel class 2 variants and used it to identify six new CRISPR-Cas subtypes. The diverse properties of these new systems provide potential for the development of versatile tools for genome editing and regulation. We performed a comprehensive census of class 2 types and subtypes in complete and draft bacterial and archaeal genomes, outlined evolutionary scenarios for the independent origin of different class 2 CRISPR-Cas systems from mobile genetic elements, and proposed an amended classification and nomenclature of CRISPR-Cas. In a separate development, we performed an exhaustive computational dissection of the domain architecture of the SAMD9 family proteins that are involved in antivirus and antitumor response in humans. We show that the SAMD9 protein family is represented in most animals and also, unexpectedly, in bacteria, in particular actinomycetes. From the N to C terminus, the core SAMD9 family architecture includes DNA/RNA-binding AlbA domain, a variant Sir2-like domain, a STAND-like P-loop NTPase, an array of TPR repeats and an OB-fold domain with predicted RNA-binding properties. Vertebrate SAMD9 family proteins contain the eponymous SAM domain capable of polymerization, whereas some family members from other animals instead contain homotypic adaptor domains of the DEATH superfamily, known as dedicated components of apoptosis networks. Such complex domain architecture is reminiscent of the STAND superfamily NTPases that are involved in various signaling processes, including programmed cell death, in both eukaryotes and prokaryotes. These findings suggest that SAMD9 is a hub of a novel, evolutionarily conserved defense network that remains to be characterized. In a more theoretically oriented project, we performed a genomic census and evolutionary analysis of repeats arrays in diverse protein families. Protein repeats are considered hotspots of protein evolution, associated with acquisition of new functions and novel phenotypic traits, including disease. Paradoxically, however, repeats are often strongly conserved through long spans of evolution. To resolve this conundrum, it is necessary to directly compare paralogous (horizontal) evolution of repeats within proteins with their orthologous (vertical) evolution through speciation. Here we develop a rigorous methodology to identify highly periodic repeats with significant sequence similarity, for which evolutionary rates and selection (dN/dS) can be estimated, and systematically characterize their evolution. We showed that horizontal evolution of repeats is markedly accelerated compared with their divergence from orthologues in closely related species. This observation is universal across the diversity of life forms and implies a biphasic evolutionary regime whereby new copies experience rapid functional divergence under combined effects of strongly relaxed purifying selection and positive selection, followed by fixation and conservation of each individual repeat. Taken together, these studies expand the known repertoire of protein domains with defined functions and lead to the discovery of novel biologically important functional systems in diverse organisms some of which are expected to have practical implications, e.g. in genome engineering. The findings also contribute to the current understanding of the routes of protein evolution.

在过去十年中，基因组序列和蛋白质结构的快速积累与序列数据库搜索方法的重大进展相似。在NCBI开发的强大位置特异性迭代爆炸（PSI-BLAST）方法构成了我们在蛋白质基序分析上工作的基础。此外，隐藏的马尔可夫模型（HMM），在HHSearch方法中实施的蛋白质概况 - 重复比较，蛋白质结构比较方法，蛋白质结构的同源模型和基因组情境分析的同源性模型被广泛且越来越多地应用。此外，已经开发和应用了蛋白质结构域概况的自定义库以及用于新型域识别的计算管道。在过去的一年中进行的这项研究导致了几类蛋白质和域的分类，进化和功能的进一步进步。特别是，我们已经对病毒式衣壳蛋白之间的关系进行了全面分析。病毒是地球上最丰富的生物学实体，并且显示出基因组序列，复制和表达策略以及病毒体结构的显着多样性。病毒的进化基因组学揭示了许多意想不到的联系，但是在细胞回归，ESC的基因和原始病毒世界假设的支持者中，病毒圈进化的一般情况仍然是激烈的争论。主要病毒蛋白的综合序列和结构分析表明它们在大约20个独立的场合中演变出来，在其中一些情况下，祖先可能在细胞生物的蛋白质中可以识别。病毒基因组通常由经常重组的不同结构和复制模块组成，并且可能具有不同的进化轨迹。该分析的结果表明，尽管至少某些类别的病毒的复制模块可能来自原始的自私遗传元素，但善意的病毒在整个过程中都在多种，独立的场合进化，这是由于募集多样化的宿主蛋白而成为主要病毒成分的多样化宿主蛋白。在另一个项目中，我们对构成2类CRISPR-CAS系统的蛋白质域进行了详细的分析和分类，该蛋白质域是最近被利用的微生物防御机制，用于开发新一代的基因组编辑工具。 2类CRISPR-CAS系统的特征是由单个多域蛋白（例如CAS9或CPF1）组成的效应器模块。我们设计了一种用于发现2类新型变体的计算管道，并用它来识别六个新的CRISPR-CAS亚型。这些新系统的各种特性为开发用于基因组编辑和调节的多功能工具提供了潜力。我们在完整和草拟的细菌和古细菌基因组中对2类类型和亚型进行了全面的人口普查，概述了移动遗传元素不同2类CRISPR-CAS系统独立起源的进化场景，并提出了修订的CRISPR-CAS分类和术语。在另一个开发中，我们对人类涉及的SAMD9家族蛋白的域结构进行了详尽的计算解剖。我们表明，SAMD9蛋白家族在大多数动物中都代表，并且出乎意料地在细菌中，尤其是放线菌。从N到C末端，Core SAMD9家族架构包括DNA/RNA结合ALBA结构域，一个变体SiR2样域，一种类似于固定的P循环NTPase，一系列TPR重复序列和具有预测RNA结合特性的OB折叠域。脊椎动物SAMD9家族蛋白包含能够聚合的同名SAM结构域，而其他动物的一些家庭成员则包含死亡超家族的同型适配域，称为细胞凋亡网络的专用组成部分。这种复杂的域结构让人联想到在真核生物和原核的各种信号传导过程（包括程序性细胞死亡，包括程序性细胞死亡）中涉及的超级家族NTPase。这些发现表明，SAMD9是一个尚待表征的新型，进化保守的防御网络的枢纽。在一个更加理论上的项目中，我们对多种蛋白质家族的重复阵列进行了基因组普查和进化分析。蛋白质重复被认为是蛋白质进化的热点，与新功能的获取和新的表型性状有关，包括疾病。然而，自相矛盾的是，重复通常是通过长期进化而强烈保守的。为了解决这个难题，有必要直接比较蛋白质中重复序列的寄生虫（水平）演变与其直系同源（垂直）通过物种形成的进化。在这里，我们开发了一种严格的方法来识别具有显着序列相似性的高度周期性重复序列，可以估算进化速率和选择（DN/DS），并系统地表征其演变。我们表明，与紧密相关的物种中的直系同源物相比，重复序列的水平演变显着加速。这种观察在生命形式的多样性中是普遍的，并暗示了双相进化制度，在强烈放松的纯化选择和积极的选择的综合效果下，新副本经历了快速的功能差异，然后对每个单个重复进行固定和保护。综上所述，这些研究扩大了具有定义功能的蛋白质领域的已知曲目，并导致在不同生物体中发现了新型生物学重要功能系统，其中一些有望具有实际含义，例如在基因组工程中。这些发现还有助于当前对蛋白质进化途径的理解。