Alignment-independent Classification of Proteins
与比对无关的蛋白质分类
基本信息
- 批准号:7050072
- 负责人:
- 金额:$ 27.21万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2004
- 资助国家:美国
- 起止时间:2004-05-01 至 2008-04-30
- 项目状态:已结题
- 来源:
- 关键词:Internetclassificationcomputer assisted sequence analysiscomputer human interactioncomputer program /softwarecomputer system design /evaluationmolecular biology information systemmolecular probesprotein localizationprotein protein interactionprotein purificationprotein sequenceprotein structure functiontechnology /technique development
项目摘要
DESCRIPTION (provided by applicant): The Human Genome project and related genome projects have stirred great hopes for improving our understanding and treatment of diseases. Central to this process is the automated detection of functional motifs and classification of protein sequences into families and/or subfamilies. Conventional approaches for protein sequence classification usually employ sequence alignment methods; other methods depend on the choice of the features included in the training sets, and on accuracy and availability of data. We propose an alignment-independent classification approach based on a search engine technology that had been successfully used in classifying medical records. Each protein is represented by a multidimensional vector, the elements of which refer to the protein's most discriminative eta-grams (sequences of eta amino acids). Preliminary studies on G protein coupled receptors (GPCRs) showed that a simple Naive Bayes classifier using straightforward eta-gram feature
selection in its preprocessing, can outperform existing classifiers including support vector machines on previously investigated, standardized GPCR sequence data subsets. Jackknife tests applied to the Protein Information Resource (PIR) Protein Sequence Database PSD and to the Pfam database (DB) of protein families showed that approximately 70% of the protein sequences are classified correctly. More significantly, the most discriminative eta-grams in a given protein family appear to have a functional or structural role, as suggested by their comparison with the sequence motifs known to be conserved or active in existing DBs and by the examination of the three-dimensional structure of representative members of the family. Encouraged by these results, we propose to pursue the following specific aims:
(1) develop a new computational tool for protein sequence analysis and protein classification based on eta-gram distributions, (2) build a comprehensive DB of protein families based on eta-gram distributions and investigate the relationships between this DB and the leading protein classification DBs, (3) determine the functional significance of the top-ranking n-grams, and (4) develop a Java based toolkit that will provide easy-to-use, yet flexible, web interface to researchers from various background. The expected deliverables are the methodology and software for classification without alignment (CWA); a new database of classified proteins, based on CWA; and an on-line server and GUI that will deliver the database and data mining tools to the scientific community in a user-friendly environment.
描述(由申请人提供):人类基因组计划和相关基因组计划为提高我们对疾病的理解和治疗带来了巨大的希望。该过程的核心是自动检测功能基序并将蛋白质序列分类为家族和/或亚家族。传统的蛋白质序列分类方法通常采用序列比对方法;其他方法取决于训练集中包含的特征的选择以及数据的准确性和可用性。我们提出了一种基于搜索引擎技术的与对齐无关的分类方法,该技术已成功用于医疗记录分类。每个蛋白质都由一个多维向量表示,其元素指的是蛋白质最具辨别力的eta-gram(eta氨基酸序列)。对 G 蛋白偶联受体 (GPCR) 的初步研究表明,使用简单的 eta-gram 特征的简单朴素贝叶斯分类器
预处理中的选择可以优于现有的分类器,包括先前研究的标准化 GPCR 序列数据子集上的支持向量机。应用于蛋白质信息资源 (PIR) 蛋白质序列数据库 PSD 和蛋白质家族 Pfam 数据库 (DB) 的 Jackknife 测试表明,大约 70% 的蛋白质序列被正确分类。 更重要的是,给定蛋白质家族中最具辨别力的 eta-gram 似乎具有功能或结构作用,正如它们与现有 DB 中已知保守或活跃的序列基序的比较以及通过检查三维模型所表明的那样。家庭代表性成员的结构。受到这些结果的鼓舞,我们建议实现以下具体目标:
(1) 开发一种新的计算工具,用于基于 eta-gram 分布的蛋白质序列分析和蛋白质分类,(2) 建立基于 eta-gram 分布的蛋白质家族综合数据库,并研究该数据库与主要蛋白质分类之间的关系DB,(3) 确定排名靠前的 n 元语法的功能意义,(4) 开发基于 Java 的工具包,为来自不同背景的研究人员提供易于使用且灵活的 Web 界面。预期交付成果是无需对齐的分类方法和软件(CWA);基于 CWA 的新的分类蛋白质数据库;以及一个在线服务器和 GUI,它将在用户友好的环境中向科学界提供数据库和数据挖掘工具。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ivet Bahar其他文献
Ivet Bahar的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ivet Bahar', 18)}}的其他基金
Toward a deeper understanding of allostery and allotargeting by computational approaches
通过计算方法更深入地理解变构和异体靶向
- 批准号:
10462594 - 财政年份:2021
- 资助金额:
$ 27.21万 - 项目类别:
Toward a deeper understanding of allostery and allotargeting by computational approaches
通过计算方法更深入地理解变构和异体靶向
- 批准号:
10231654 - 财政年份:2021
- 资助金额:
$ 27.21万 - 项目类别:
Toward a deeper understanding of allostery and allotargeting by computational approaches
通过计算方法更深入地理解变构和异体靶向
- 批准号:
10887238 - 财政年份:2021
- 资助金额:
$ 27.21万 - 项目类别:
Toward a deeper understanding of allostery and allotargeting by computational approaches
通过计算方法更深入地理解变构和异体靶向
- 批准号:
10612069 - 财政年份:2021
- 资助金额:
$ 27.21万 - 项目类别:
Structure and function of PTH class B GPCR
PTH B 类 GPCR 的结构和功能
- 批准号:
10657916 - 财政年份:2018
- 资助金额:
$ 27.21万 - 项目类别:
NIDA Center of Excellence OF Computational Drug Abuse Research (CDAR)
NIDA 计算药物滥用研究卓越中心 (CDAR)
- 批准号:
8896676 - 财政年份:2014
- 资助金额:
$ 27.21万 - 项目类别:
NIDA Center of Excellence OF Computational Drug Abuse Research (CDAR)
NIDA 计算药物滥用研究卓越中心 (CDAR)
- 批准号:
8743368 - 财政年份:2014
- 资助金额:
$ 27.21万 - 项目类别:
Center for causal Modeling and discovery of Biomedical Knowledge from Big Data
大数据因果建模和生物医学知识发现中心
- 批准号:
8935874 - 财政年份:2014
- 资助金额:
$ 27.21万 - 项目类别:
Center for causal Modeling and discovery of Biomedical Knowledge from Big Data
大数据因果建模和生物医学知识发现中心
- 批准号:
9404096 - 财政年份:2014
- 资助金额:
$ 27.21万 - 项目类别:
相似国自然基金
紫堇属南黄堇组的分类修订
- 批准号:32300176
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
高光谱图像可信深度学习分类方法研究
- 批准号:62371169
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
宁夏典型地物目标分类及其多源遥感影像信息处理模型与算法研究
- 批准号:42361056
- 批准年份:2023
- 资助金额:33 万元
- 项目类别:地区科学基金项目
具有相同内禀增长率的三维Lotka-Volterra系统的全局分类
- 批准号:12301221
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
超平面配置中的分类问题
- 批准号:12301424
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
AMAUTA HEALTH INFORMATICS RESEARCH AND TRAINING PROGRAM
AMAUTA 健康信息学研究和培训计划
- 批准号:
7249492 - 财政年份:2004
- 资助金额:
$ 27.21万 - 项目类别:
Washington Obstetric-Fetal Pharmacology Research Unit
华盛顿产胎儿药理学研究单位
- 批准号:
7695403 - 财政年份:2004
- 资助金额:
$ 27.21万 - 项目类别: