The identification of catalytic residues is an essential step in functional characterization of enzymes. We present a purely structural approach to this problem, which is motivated by the difficulty of evolution-based methods to annotate structural genomics targets that have few or no homologs in the databases. Our approach combines a state-of-the-art support vector machine (SVM) classifier with novel structural features that augment structural clues by spatial averaging and Z-scoring. Special attention is paid to the class imbalance problem that stems from the overwhelming number of non-catalytic residues in enzymes compared to catalytic residues. This problem is tackled by: 1) optimizing the classifier to maximize a performance criterion that considers both type I and type II errors in the classification of catalytic and non-catalytic residues; 2) under-sampling non-catalytic residues before SVM training; and 3) during SVM training, penalizing errors in learning catalytic residues more than errors in learning non-catalytic residues. Tested on four enzyme datasets – one specifically designed by us to mimic the structural genomics scenario and three previously-evaluated datasets – our structure-based classifier is never inferior to similar structure-based classifiers and comparable to classifiers that use both structural and evolutionary features. In addition to evaluation of the performance of catalytic residue identification, we also present detailed case studies on three proteins. This analysis suggests that many false positive predictions may correspond to binding sites and other functional residues. A web server that implements the method, our own-designed database, and the source code of the programs are publicly available at http://www.cs.bgu.ac.il/~meshi/functionPrediction.
催化残基的鉴定是酶功能特性描述的关键步骤。我们针对该问题提出了一种纯结构方法,其动机是基于进化的方法难以对数据库中鲜有或没有同源物的结构基因组学目标进行注释。我们的方法将一种最先进的支持向量机(SVM)分类器与新的结构特征相结合,这些特征通过空间平均和Z - 评分增强了结构线索。我们特别关注类别不平衡问题,该问题源于酶中与催化残基相比数量极多的非催化残基。这个问题通过以下方式解决:1)优化分类器以最大化一个性能标准,该标准在催化残基和非催化残基的分类中同时考虑I型和II型错误;2)在SVM训练前对非催化残基进行欠采样;3)在SVM训练期间,对学习催化残基时的错误惩罚比对学习非催化残基时的错误惩罚更重。在四个酶数据集上进行测试——一个由我们专门设计以模拟结构基因组学情形的数据集以及三个先前评估过的数据集——我们基于结构的分类器绝不劣于类似的基于结构的分类器,并且与使用结构和进化特征的分类器相当。除了对催化残基鉴定性能的评估,我们还对三种蛋白质进行了详细的案例研究。该分析表明,许多假阳性预测可能对应于结合位点和其他功能性残基。一个实现该方法的网络服务器、我们自己设计的数据库以及程序的源代码可在http://www.cs.bgu.ac.il/~meshi/functionPrediction公开获取。