By reducing amino acid alphabet, the protein complexity can be significantly simplified, which could improve computational efficiency, decrease information redundancy and reduce chance of overfitting. Although some reduced alphabets have been proposed, different classification rules could produce distinctive results for protein sequence analysis. Thus, it is urgent to construct a systematical frame for reduced alphabets. In this work, we constructed a comprehensive web server called RAACBook for protein sequence analysis and machine learning application by integrating reduction alphabets. The web server contains three parts: (i) 74 types of reduced amino acid alphabet were manually extracted to generate 673 reduced amino acid clusters (RAACs) for dealing with unique protein problems. It is easy for users to select desired RAACs from a multilayer browser tool. (ii) An online tool was developed to analyze primary sequence of protein. The tool could produce K-tuple reduced amino acid composition by defining three correlation parameters (K-tuple, g-gap lambda-correlation). The results are visualized as sequence alignment, mergence of RAA composition, feature distribution and logo of reduced sequence. (iii) The machine learning server is provided to train the model of protein classification based on K-tuple RAAC. The optimal model could be selected according to the evaluation indexes (ROC, AUC, MCC, etc.). In conclusion, RAACBook presents a powerful and user-friendly service in protein sequence analysis and computational proteomics. RAACBook can be freely available at http://bioinfor.imu.edu.cn/raacbook.Database URL: http://bioinfor.imu.edu.cn/raacbook
通过减少氨基酸字母表,蛋白质的复杂性能够显著简化,这可以提高计算效率、减少信息冗余并降低过拟合的可能性。尽管已经提出了一些简化的字母表,但不同的分类规则可能会在蛋白质序列分析中产生不同的结果。因此,构建一个用于简化字母表的系统框架迫在眉睫。在这项工作中,我们通过整合简化字母表构建了一个名为RAACBook的综合网络服务器,用于蛋白质序列分析和机器学习应用。该网络服务器包含三个部分:(i)手动提取了74种简化氨基酸字母表,以生成673个简化氨基酸簇(RAAC)来处理独特的蛋白质问题。用户可以很容易地从多层浏览器工具中选择所需的RAAC。(ii)开发了一个在线工具来分析蛋白质的一级序列。该工具可以通过定义三个相关参数(K -元组、g -间隙λ -相关性)生成K -元组简化氨基酸组成。结果可视化为序列比对、RAA组成的合并、特征分布和简化序列的标识。(iii)提供了机器学习服务器,用于基于K -元组RAAC训练蛋白质分类模型。可以根据评估指标(ROC、AUC、MCC等)选择最优模型。总之,RAACBook在蛋白质序列分析和计算蛋白质组学方面提供了强大且用户友好的服务。RAACBook可在http://bioinfor.imu.edu.cn/raacbook免费获取。数据库网址:http://bioinfor.imu.edu.cn/raacbook