Developing computational methods for assigning protein function from tertiary structure is a very important problem, predicting a catalytic mechanism based only on structural information being a particularly challenging task. This work focuses on helping to understand the molecular basis of catalysis by exploring the nature of catalytic residues, their environment and characteristic properties in a large data set of enzyme structures and using this information to predict enzyme structures' active sites. A machine learning approach that performs feature extraction, clustering and classification on a protein structure data set is proposed. The 6,376 residues directly involved in enzyme catalysis, present in more than 800 proteins structures in the PDB were analyzed. Feature extraction provided a description of critical features for each catalytic residue, which were consistent with prior knowledge about them. Results from k-fold-cross-validation for classification showed more than 80% accuracy. Complete enzymes were scanned using these classifiers to locate catalytic residues.
开发从三级结构推断蛋白质功能的计算方法是一个非常重要的问题,仅基于结构信息预测催化机制是一项特别具有挑战性的任务。这项工作致力于通过探索大量酶结构数据集中催化残基的性质、其环境和特征特性,并利用这些信息预测酶结构的活性位点,来帮助理解催化的分子基础。提出了一种在蛋白质结构数据集上进行特征提取、聚类和分类的机器学习方法。对PDB(蛋白质数据库)中800多个蛋白质结构中直接参与酶催化的6376个残基进行了分析。特征提取为每个催化残基提供了关键特征的描述,这与先前对它们的了解是一致的。分类的k折交叉验证结果显示准确率超过80%。使用这些分类器对完整的酶进行扫描以定位催化残基。