Statistics of Sequence Comparison

序列比较统计

基本信息

批准号：
10007519
负责人：
STEPHEN F ALTSCHUL
金额：
$ 23.53万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

项目摘要

The current direction of this project, in collaboration with Dr. Andrew Neuwald of the Institute for Genome Sciences and Department of Biochemistry & Molecular Biology at the University of Maryland School of Medicine, continued throughout this year. Previous focuses had been the development of an improved method for multiple alignment that could identify the common elements shared by large and diverse protein superfamilies, and the extension of this method to a hierarchical multiple alignment model. Such a model is based on the fact that large protein superfamilies frequently have diversified to fulfill distinct functional roles within different subfamilies. Each subfamily has distinct structural constraints, which yield distinct amino acid frequency vectors at particular positions characteristic of that subfamily. Although, within a subfamily, the amino acids at different positions may be independent, the changes in frequency vectors across multiple positions characteristic of each subfamily yields the appearance of correlation between positions when a simple, non-hierarchical model of a superfamily is constructed. Earlier approaches have modeled these apparent correlations directly, using pairwise coupling terms, but we model them by constructing an explicit hierarchical model, with individual sequences assigned to distinct nodes within the hierarchy. We applied the Minimum Description Length principle to insure that the hierarchical models we construct do not overfit the data, but have statistical support. This year the central focus this project was the statistical assessment of the three-dimensional clustering of "distinguished positions", identified as characteristic of various nodes in a hierarchy. Our approach, called Initial Cluster Analysis (ICA), seeks to determine whether a set of distinguished elements within a linear array is clustered significantly near the start of the array and, if so, what is the most significant initial cluster of these elements. Abstractly, given a linear array of length L containing D '1's (the distinguished elements) and L-D '0's, it considers a generative model in which in which the '1's occur with particular and differing probabilities before and after a cut point X in the array. For any particular X it is relatively easy to calculate a likelihood Like(X) of the array of data, and one may optimize Like(X) by simply evaluating it for all possible X. However, the values of Like(X) for close values of X are highly correlated, dependent upon a calculable "density of independent trials" Rho(X). Because Rho(X) is not constant but rather grows approximately as the reciprocal of X's distance from 0 or L, simply optimizing Like(X) inherently favors, a priori, small or large values of X. Therefore, if one's application suggests no such bias, choosing to optimize Like(X)/Rho(X) rather than Like(X) for a given array of '0's and '1's may be a better strategy; we refer to this approach as using "flattened priors". ICA estimates the effective total number of independent trials implicit in either optimization, which it uses in calculating a p-value for the optimal X. This provides a mathematically principled way to define an optimal initial cluster of distinguished elements, balancing the claims of very short and dense clusters with those of longer but sparser clusters. We published ICA in the Journal of Computational Biology. To analyze real proteins using ICA, we ordered the residues within a protein by their physical distance from a point of reference, and used our previously-developed hierarchical analysis to define a set of distinguished residues, characteristic of a protein family or subfamily. ICA then allows us to find sets of distinguished residues that are significantly clustered in three dimensions. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. This work was published in eLife. We initiated work on a new project to summarize and analyze the constraints on protein sequence and structure that may be derived from large multiple sequence alignments. For a particular protein, these constraints include those on amino acid usage in particular positions due to the protein's subfamily function, as well as those constraints characteristic of the family and superfamily of which the protein is a member. Additional constraints, which may be derived from DCA, are due to internal or heterodimeric pairwise interactions between different protein positions. The integrated analysis of these various constraints can suggest new lines for experimentation.

该项目的当前方向与博士合作基因组科学研究所的安德鲁·诺瓦尔德（Andrew Neuwald）马里兰大学生物化学与分子生物学医学院，今年继续。以前的重点是开发用于多重的改进方法可以识别大型共享的共同元素的一致性和多种蛋白质超家族，以及该方法的扩展到分层多重对齐模型。这样的模型是基于关于大蛋白质超家族经常具有多元化以履行不同的功能角色亚家族。每个亚科都有不同的结构约束，在特定的该亚家族的位置特征。虽然，亚家族，不同位置的氨基酸可能是独立的，跨多个位置的频率向量的变化每个亚家族的特征都会出现当一个简单的非层次结构时，位置之间的相关性构建了超家族的模型。较早的方法有直接使用成对建模这些明显的相关性耦合术语，但我们通过构建明确的分层模型，分配给不同的序列层次结构内的节点。我们应用了最低描述长度原理可以确保我们的分层模型我们构造不要过分拟合数据，而要有统计支持。今年，该项目的核心重点是统计评估“区分”的三维聚类位置”，被确定为各种节点的特征层次结构。我们的方法称为初始聚类分析（ICA），试图确定是否其中一组杰出的要素线性阵列在开始的开始接近阵列，如果是的话，最重要的初始群集是什么这些要素。抽象地，给定长度为l的线性阵列包含d'1（杰出元素）和l-d'0，它考虑了一种发生“ 1的生成模型” 在a之前和之后有特殊和不同的概率在阵列中切点X。对于任何特定的X 易于计算的可能性，例如（x）数据数组，并且可以通过简单地对所有人进行评估来优化（x）可能的X。但是，关闭值的like（x）值 X的高度相关，取决于可计算的密度独立试验“ rho（x）。因为rho（x）不是恒定的而是随着X距离的倒数大约生长从0或l，简单地优化（x）固有的偏爱，先验， X的小或大值。因此，如果一个人的应用建议没有这样的偏见，选择优化（x）/rho（x）而不是比（x）在给定的'0和1的阵列中可能更好战略;我们将这种方法称为使用“平坦的先验”。 ICA估计独立试验的有效总数在两种优化中隐含，它在计算中使用最佳X的P值。这在数学上提供了定义最佳初始集群的原则方法杰出的要素，平衡很短的主张以及较长但稀疏的簇的密集簇。我们在《计算生物学杂志》上发表了ICA。为了使用ICA分析实际蛋白质，我们订购了残基蛋白质的蛋白质与参考点的物理距离，并使用我们以前开发的层次分析来定义一组杰出残留物，蛋白质家族的特征或亚家族。 ICA然后允许我们找到一组杰出的显着聚集在三个维度的残基。将此方法应用于N-乙酰基转移酶，P-Loop GTPases， RNA解旋酶，突触素蛋白酶磷酸酶和核酸酶，核酸酶，胸骨/尿嘧啶DNA糖基酶得出的结果与对这些蛋白质的生化理解，也揭示了引人注目的序列结构特征被其他方法忽略了。这项工作发表在Elife。我们开始了一个新项目的工作，以总结和分析可能得出的蛋白质序列和结构的约束来自大的多个序列比对。对于特定蛋白质，这些限制尤其包括在氨基酸使用方面的约束。由于蛋白质的亚家族功能以及这些限制是家庭和超家族的特征其中蛋白质是成员。其他约束，这可能源自DCA，是由于内部或异二聚体引起的不同蛋白质位置之间的成对相互作用。这对这些各种约束的综合分析可以提出新的实验线。