Alignment-independent Classification of Proteins

与比对无关的蛋白质分类

基本信息

  • 批准号:
    7050072
  • 负责人:
  • 金额:
    $ 27.21万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
  • 财政年份:
    2004
  • 资助国家:
    美国
  • 起止时间:
    2004-05-01 至 2008-04-30
  • 项目状态:
    已结题

项目摘要

DESCRIPTION (provided by applicant): The Human Genome project and related genome projects have stirred great hopes for improving our understanding and treatment of diseases. Central to this process is the automated detection of functional motifs and classification of protein sequences into families and/or subfamilies. Conventional approaches for protein sequence classification usually employ sequence alignment methods; other methods depend on the choice of the features included in the training sets, and on accuracy and availability of data. We propose an alignment-independent classification approach based on a search engine technology that had been successfully used in classifying medical records. Each protein is represented by a multidimensional vector, the elements of which refer to the protein's most discriminative eta-grams (sequences of eta amino acids). Preliminary studies on G protein coupled receptors (GPCRs) showed that a simple Naive Bayes classifier using straightforward eta-gram feature selection in its preprocessing, can outperform existing classifiers including support vector machines on previously investigated, standardized GPCR sequence data subsets. Jackknife tests applied to the Protein Information Resource (PIR) Protein Sequence Database PSD and to the Pfam database (DB) of protein families showed that approximately 70% of the protein sequences are classified correctly. More significantly, the most discriminative eta-grams in a given protein family appear to have a functional or structural role, as suggested by their comparison with the sequence motifs known to be conserved or active in existing DBs and by the examination of the three-dimensional structure of representative members of the family. Encouraged by these results, we propose to pursue the following specific aims: (1) develop a new computational tool for protein sequence analysis and protein classification based on eta-gram distributions, (2) build a comprehensive DB of protein families based on eta-gram distributions and investigate the relationships between this DB and the leading protein classification DBs, (3) determine the functional significance of the top-ranking n-grams, and (4) develop a Java based toolkit that will provide easy-to-use, yet flexible, web interface to researchers from various background. The expected deliverables are the methodology and software for classification without alignment (CWA); a new database of classified proteins, based on CWA; and an on-line server and GUI that will deliver the database and data mining tools to the scientific community in a user-friendly environment.
描述(由申请人提供):人类基因组项目和相关的基因组项目激起了我们对疾病的理解和治疗的巨大希望。此过程的核心是对功能基序的自动检测以及将蛋白质序列分类为家族和/或亚家族。蛋白质序列分类的常规方法通常采用序列比对方法。其他方法取决于训练集中包含的功能的选择,以及数据的准确性和可用性。我们提出了一种基于搜索引擎技术的独立分类方法,该方法已成功用于分类医疗记录。每种蛋白质由多维矢量表示,其元素是指蛋白质最歧视的ETA-gram(ETA氨基酸序列)。 G蛋白耦合受体(GPCR)的初步研究表明,使用直接ETA-Gram特征的简单幼稚贝叶斯分类器 在其预处理中的选择可以胜过现有的分类器,包括先前研究的标准化GPCR序列数据子集(支持向量机)。用于蛋白质信息资源(PIR)蛋白序列数据库PSD和蛋白质家族的PFAM数据库(DB)的夹克刀测试表明,大约70%的蛋白质序列正确分类。 更重要的是,给定蛋白质家族中最歧视的ETA-Gram似乎具有功能性或结构性作用,如它们与已知的序列基序相比,在现有DBS中保守或活跃,并检查了该家族代表性成员的三维结构。在这些结果的鼓励下,我们建议追求以下具体目标: (1) develop a new computational tool for protein sequence analysis and protein classification based on eta-gram distributions, (2) build a comprehensive DB of protein families based on eta-gram distributions and investigate the relationships between this DB and the leading protein classification DBs, (3) determine the functional significance of the top-ranking n-grams, and (4) develop a Java based toolkit that will provide easy-to-use, yet flexible, web interface to来自各种背景的研究人员。预期的可交付成果是无需对齐的分类方法和软件(CWA);基于CWA的分类蛋白质的新数据库;以及在线服务器和GUI,将在用户友好的环境中向科学界传递数据库和数据挖掘工具。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Ivet Bahar其他文献

Ivet Bahar的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Ivet Bahar', 18)}}的其他基金

Toward a deeper understanding of allostery and allotargeting by computational approaches
通过计算方法更深入地理解变构和异体靶向
  • 批准号:
    10462594
  • 财政年份:
    2021
  • 资助金额:
    $ 27.21万
  • 项目类别:
Toward a deeper understanding of allostery and allotargeting by computational approaches
通过计算方法更深入地理解变构和异体靶向
  • 批准号:
    10231654
  • 财政年份:
    2021
  • 资助金额:
    $ 27.21万
  • 项目类别:
Toward a deeper understanding of allostery and allotargeting by computational approaches
通过计算方法更深入地理解变构和异体靶向
  • 批准号:
    10887238
  • 财政年份:
    2021
  • 资助金额:
    $ 27.21万
  • 项目类别:
Toward a deeper understanding of allostery and allotargeting by computational approaches
通过计算方法更深入地理解变构和异体靶向
  • 批准号:
    10612069
  • 财政年份:
    2021
  • 资助金额:
    $ 27.21万
  • 项目类别:
Structure and function of PTH class B GPCR
PTH B 类 GPCR 的结构和功能
  • 批准号:
    10657916
  • 财政年份:
    2018
  • 资助金额:
    $ 27.21万
  • 项目类别:
NIDA Center of Excellence OF Computational Drug Abuse Research (CDAR)
NIDA 计算药物滥用研究卓越中心 (CDAR)
  • 批准号:
    8896676
  • 财政年份:
    2014
  • 资助金额:
    $ 27.21万
  • 项目类别:
BD2K Consortium Activities
BD2K联盟活动
  • 批准号:
    8932081
  • 财政年份:
    2014
  • 资助金额:
    $ 27.21万
  • 项目类别:
NIDA Center of Excellence OF Computational Drug Abuse Research (CDAR)
NIDA 计算药物滥用研究卓越中心 (CDAR)
  • 批准号:
    8743368
  • 财政年份:
    2014
  • 资助金额:
    $ 27.21万
  • 项目类别:
Center for causal Modeling and discovery of Biomedical Knowledge from Big Data
大数据因果建模和生物医学知识发现中心
  • 批准号:
    8935874
  • 财政年份:
    2014
  • 资助金额:
    $ 27.21万
  • 项目类别:
Center for causal Modeling and discovery of Biomedical Knowledge from Big Data
大数据因果建模和生物医学知识发现中心
  • 批准号:
    9404096
  • 财政年份:
    2014
  • 资助金额:
    $ 27.21万
  • 项目类别:

相似国自然基金

菊三七属(菊科-千里光族)的分类学研究
  • 批准号:
    32370222
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
基于格局与功能原理的遥感场景分类研究
  • 批准号:
    42371473
  • 批准年份:
    2023
  • 资助金额:
    46 万元
  • 项目类别:
    面上项目
基于机器学习的青藏高原河岸沙丘分类与演化模式研究
  • 批准号:
    42371008
  • 批准年份:
    2023
  • 资助金额:
    52 万元
  • 项目类别:
    面上项目
世界锤角叶蜂科系统分类和系统发育研究
  • 批准号:
    32370500
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
中国蝌蚪的形态多样性和分类学研究
  • 批准号:
    32370478
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目

相似海外基金

AMAUTA HEALTH INFORMATICS RESEARCH AND TRAINING PROGRAM
AMAUTA 健康信息学研究和培训计划
  • 批准号:
    7249492
  • 财政年份:
    2004
  • 资助金额:
    $ 27.21万
  • 项目类别:
Alignment-independent Classification of Proteins
与比对无关的蛋白质分类
  • 批准号:
    6890047
  • 财政年份:
    2004
  • 资助金额:
    $ 27.21万
  • 项目类别:
Washington Obstetric-Fetal Pharmacology Research Unit
华盛顿产胎儿药理学研究单位
  • 批准号:
    7695403
  • 财政年份:
    2004
  • 资助金额:
    $ 27.21万
  • 项目类别:
Alignment-independent Classification of Proteins
与比对无关的蛋白质分类
  • 批准号:
    6777359
  • 财政年份:
    2004
  • 资助金额:
    $ 27.21万
  • 项目类别:
LAYING THE FOUNDATION FOR GENOMIC ENZYMOLOGY
为基因组酶学奠定基础
  • 批准号:
    6636387
  • 财政年份:
    2000
  • 资助金额:
    $ 27.21万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了