Statistics of Sequence Comparison
序列比较统计
基本信息
- 批准号:10007519
- 负责人:
- 金额:$ 23.53万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:
- 资助国家:美国
- 起止时间:至
- 项目状态:未结题
- 来源:
- 关键词:AcetyltransferaseAmino Acid SequenceAmino AcidsAppearanceBiochemicalBiochemistryBiologicalCharacteristicsCluster AnalysisCollaborationsComputational BiologyCouplingDNA SequenceDataDevelopmentDimensionsElementsEquilibriumFamily CharacteristicsFrequenciesGoalsGuanosine Triphosphate PhosphohydrolasesIndividualInstitutesJournalsLengthMarylandMathematicsMethodsModelingMolecular BiologyPatternPhosphoric Monoester HydrolasesPositioning AttributeProbabilityProtein FamilyProteinsPublishingRNA HelicaseRoleSequence AlignmentStructural ProteinStructureSystemThymineUniversitiesWorkbasedensitygenome sciencesimprovedmedical schoolsmembernucleaseprotein structurerhostatisticssynaptojaninuracil-DNA glycosylasevector
项目摘要
The current direction of this project, in collaboration with Dr.
Andrew Neuwald of the Institute for Genome Sciences and Department
of Biochemistry & Molecular Biology at the University of Maryland
School of Medicine, continued throughout this year. Previous
focuses had been the development of an improved method for multiple
alignment that could identify the common elements shared by large
and diverse protein superfamilies, and the extension of this method
to a hierarchical multiple alignment model. Such a model is based
on the fact that large protein superfamilies frequently have
diversified to fulfill distinct functional roles within different
subfamilies. Each subfamily has distinct structural constraints,
which yield distinct amino acid frequency vectors at particular
positions characteristic of that subfamily. Although, within a
subfamily, the amino acids at different positions may be independent,
the changes in frequency vectors across multiple positions
characteristic of each subfamily yields the appearance of
correlation between positions when a simple, non-hierarchical
model of a superfamily is constructed. Earlier approaches have
modeled these apparent correlations directly, using pairwise
coupling terms, but we model them by constructing an explicit
hierarchical model, with individual sequences assigned to distinct
nodes within the hierarchy. We applied the Minimum Description
Length principle to insure that the hierarchical models we
construct do not overfit the data, but have statistical support.
This year the central focus this project was the statistical
assessment of the three-dimensional clustering of "distinguished
positions", identified as characteristic of various nodes in
a hierarchy. Our approach, called Initial Cluster Analysis (ICA),
seeks to determine whether a set of distinguished elements within
a linear array is clustered significantly near the start of the
array and, if so, what is the most significant initial cluster
of these elements. Abstractly, given a linear array of length L
containing D '1's (the distinguished elements) and L-D '0's,
it considers a generative model in which in which the '1's occur
with particular and differing probabilities before and after a
cut point X in the array. For any particular X it is relatively
easy to calculate a likelihood Like(X) of the array of data,
and one may optimize Like(X) by simply evaluating it for all
possible X. However, the values of Like(X) for close values
of X are highly correlated, dependent upon a calculable "density
of independent trials" Rho(X). Because Rho(X) is not constant
but rather grows approximately as the reciprocal of X's distance
from 0 or L, simply optimizing Like(X) inherently favors, a priori,
small or large values of X. Therefore, if one's application
suggests no such bias, choosing to optimize Like(X)/Rho(X) rather
than Like(X) for a given array of '0's and '1's may be a better
strategy; we refer to this approach as using "flattened priors".
ICA estimates the effective total number of independent trials
implicit in either optimization, which it uses in calculating
a p-value for the optimal X. This provides a mathematically
principled way to define an optimal initial cluster of
distinguished elements, balancing the claims of very short
and dense clusters with those of longer but sparser clusters.
We published ICA in the Journal of Computational Biology.
To analyze real proteins using ICA, we ordered the residues within
a protein by their physical distance from a point of reference,
and used our previously-developed hierarchical analysis to define
a set of distinguished residues, characteristic of a protein family
or subfamily. ICA then allows us to find sets of distinguished
residues that are significantly clustered in three dimensions.
Applying this approach to N-acetyltransferases, P-loop GTPases,
RNA helicases, synaptojanin-superfamily phosphatases and nucleases,
and thymine/uracil DNA glycosylases yielded results congruent with
biochemical understanding of these proteins, and also revealed
striking sequence-structural features overlooked by other methods.
This work was published in eLife.
We initiated work on a new project to summarize and analyze the
constraints on protein sequence and structure that may be derived
from large multiple sequence alignments. For a particular protein,
these constraints include those on amino acid usage in particular
positions due to the protein's subfamily function, as well as
those constraints characteristic of the family and superfamily
of which the protein is a member. Additional constraints, which
may be derived from DCA, are due to internal or heterodimeric
pairwise interactions between different protein positions. The
integrated analysis of these various constraints can suggest new
lines for experimentation.
该项目的当前方向与博士合作
基因组科学研究所的安德鲁·诺瓦尔德(Andrew Neuwald)
马里兰大学生物化学与分子生物学
医学院,今年继续。 以前的
重点是开发用于多重的改进方法
可以识别大型共享的共同元素的一致性
和多种蛋白质超家族,以及该方法的扩展
到分层多重对齐模型。 这样的模型是基于
关于大蛋白质超家族经常具有
多元化以履行不同的功能角色
亚家族。 每个亚科都有不同的结构约束,
在特定的
该亚家族的位置特征。 虽然,
亚家族,不同位置的氨基酸可能是独立的,
跨多个位置的频率向量的变化
每个亚家族的特征都会出现
当一个简单的非层次结构时,位置之间的相关性
构建了超家族的模型。 较早的方法有
直接使用成对建模这些明显的相关性
耦合术语,但我们通过构建明确的
分层模型,分配给不同的序列
层次结构内的节点。 我们应用了最低描述
长度原理可以确保我们的分层模型我们
构造不要过分拟合数据,而要有统计支持。
今年,该项目的核心重点是统计
评估“区分”的三维聚类
位置”,被确定为各种节点的特征
层次结构。 我们的方法称为初始聚类分析(ICA),
试图确定是否其中一组杰出的要素
线性阵列在开始的开始接近
阵列,如果是的话,最重要的初始群集是什么
这些要素。 抽象地,给定长度为l的线性阵列
包含d'1(杰出元素)和l-d'0,
它考虑了一种发生“ 1的生成模型”
在a之前和之后有特殊和不同的概率
在阵列中切点X。 对于任何特定的X
易于计算的可能性,例如(x)数据数组,
并且可以通过简单地对所有人进行评估来优化(x)
可能的X。但是,关闭值的like(x)值
X的高度相关,取决于可计算的密度
独立试验“ rho(x)。因为rho(x)不是恒定的
而是随着X距离的倒数大约生长
从0或l,简单地优化(x)固有的偏爱,先验,
X的小或大值。因此,如果一个人的应用
建议没有这样的偏见,选择优化(x)/rho(x)而不是
比(x)在给定的'0和1的阵列中可能更好
战略;我们将这种方法称为使用“平坦的先验”。
ICA估计独立试验的有效总数
在两种优化中隐含,它在计算中使用
最佳X的P值。这在数学上提供了
定义最佳初始集群的原则方法
杰出的要素,平衡很短的主张
以及较长但稀疏的簇的密集簇。
我们在《计算生物学杂志》上发表了ICA。
为了使用ICA分析实际蛋白质,我们订购了残基
蛋白质的蛋白质与参考点的物理距离,
并使用我们以前开发的层次分析来定义
一组杰出残留物,蛋白质家族的特征
或亚家族。 ICA然后允许我们找到一组杰出的
显着聚集在三个维度的残基。
将此方法应用于N-乙酰基转移酶,P-Loop GTPases,
RNA解旋酶,突触素蛋白酶磷酸酶和核酸酶,核酸酶,
胸骨/尿嘧啶DNA糖基酶得出的结果与
对这些蛋白质的生化理解,也揭示了
引人注目的序列结构特征被其他方法忽略了。
这项工作发表在Elife。
我们开始了一个新项目的工作,以总结和分析
可能得出的蛋白质序列和结构的约束
来自大的多个序列比对。 对于特定蛋白质,
这些限制尤其包括在氨基酸使用方面的约束。
由于蛋白质的亚家族功能以及
这些限制是家庭和超家族的特征
其中蛋白质是成员。 其他约束,这
可能源自DCA,是由于内部或异二聚体引起的
不同蛋白质位置之间的成对相互作用。 这
对这些各种约束的综合分析可以提出新的
实验线。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
STEPHEN F ALTSCHUL其他文献
STEPHEN F ALTSCHUL的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('STEPHEN F ALTSCHUL', 18)}}的其他基金
Improvements And Extensions To The Blast Algorithms
Blast 算法的改进和扩展
- 批准号:
6546809 - 财政年份:
- 资助金额:
$ 23.53万 - 项目类别:
Improvements And Extensions To The Blast Algorithms
Blast 算法的改进和扩展
- 批准号:
6843572 - 财政年份:
- 资助金额:
$ 23.53万 - 项目类别:
IMPROVEMENTS AND EXTENSIONS TO THE BLAST ALGORITHMS
Blast 算法的改进和扩展
- 批准号:
6432754 - 财政年份:
- 资助金额:
$ 23.53万 - 项目类别:
Improvements and Extensions to the BLAST Algorithms
BLAST 算法的改进和扩展
- 批准号:
9555732 - 财政年份:
- 资助金额:
$ 23.53万 - 项目类别:
相似国自然基金
基于祖先序列重构的D-氨基酸解氨酶的新酶设计及分子进化
- 批准号:32271536
- 批准年份:2022
- 资助金额:54.00 万元
- 项目类别:面上项目
模板化共晶聚合合成高分子量序列聚氨基酸
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
模板化共晶聚合合成高分子量序列聚氨基酸
- 批准号:22201105
- 批准年份:2022
- 资助金额:30.00 万元
- 项目类别:青年科学基金项目
基于祖先序列重构的D-氨基酸解氨酶的新酶设计及分子进化
- 批准号:
- 批准年份:2022
- 资助金额:54 万元
- 项目类别:面上项目
C-末端40个氨基酸插入序列促进细菌脂肪酸代谢调控因子FadR转录效率的机制研究
- 批准号:82003257
- 批准年份:2020
- 资助金额:24 万元
- 项目类别:青年科学基金项目
相似海外基金
Enzymology of Bacteroides short and branched chain fatty acid metabolism
拟杆菌短链和支链脂肪酸代谢的酶学
- 批准号:
10651505 - 财政年份:2023
- 资助金额:
$ 23.53万 - 项目类别:
BRD2-MULTIPROTEIN COMPLEXES IN MAMMALIAN CELL CYCLE TRANSCRIPTIONAL CONTROL
哺乳动物细胞周期转录控制中的 BRD2-多蛋白复合物
- 批准号:
8170865 - 财政年份:2010
- 资助金额:
$ 23.53万 - 项目类别:
BRD2-MULTIPROTEIN COMPLEXES IN MAMMALIAN CELL CYCLE TRANSCRIPTIONAL CONTROL
哺乳动物细胞周期转录控制中的 BRD2-多蛋白复合物
- 批准号:
7955890 - 财政年份:2009
- 资助金额:
$ 23.53万 - 项目类别:
Regulation and Gene Expression of Yeast Cytochrome c
酵母细胞色素c的调控及基因表达
- 批准号:
7926360 - 财政年份:2009
- 资助金额:
$ 23.53万 - 项目类别:
BRD2-MULTIPROTEIN COMPLEXES IN MAMMALIAN CELL CYCLE TRANSCRIPTIONAL CONTROL
哺乳动物细胞周期转录控制中的 BRD2-多蛋白复合物
- 批准号:
7722965 - 财政年份:2008
- 资助金额:
$ 23.53万 - 项目类别: