Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam

利用数据驱动的计算方法来理解 InterPro 和 Pfam 中的蛋白质结构和功能

基本信息

  • 批准号:
    BB/S020381/1
  • 负责人:
  • 金额:
    $ 103.95万
  • 依托单位:
  • 依托单位国家:
    英国
  • 项目类别:
    Research Grant
  • 财政年份:
    2019
  • 资助国家:
    英国
  • 起止时间:
    2019 至 无数据
  • 项目状态:
    已结题

项目摘要

Proteins are biological macromolecules that perform a diverse array of crucial functions, from enzymes (e.g. the entities responsible for fermentation) to transporters (e.g. hemoglobin in the blood) to mechanical structures (e.g. actin and myosin in muscle). Proteins are synthesized as linear polymers of building blocks called amino acids. They usually fold into complex three-dimensional (3D) structures, and typically interact with other proteins and molecules to perform their function. Knowledge of protein sequences can facilitate insights into hitherto undiscovered enzymes with potential applications in the biotechnology sector, or novel drugs of interest to the pharmaceutical industry. Detailed understanding of the functional architecture of proteins, including the arrangement of amino acids in a 3D structure, enables scientists to diagnose diseases as well as design more effective enzymes. These days, our ability to generate new protein sequences based on modern high-throughput DNA sequencing (HTS) techniques far outstrips our ability to functionally characterise them. Thus, most sequences are computationally annotated, by identifying similarities between new sequences and the few experimentally characterised examples, using these to infer function (i.e. annotate). More recently, HTS has been applied directly to environmental samples to discover previously uncultured bacteria and single cell eukaryotes, and to enable the reconstruction of large and complex genomes, like plants. Such approaches are correcting many of the historical biases in the protein sequence databases. However, for humankind to understand and utilise these data, sequences need to be functionally annotated, which is best accomplished using the information gleaned from sets of related sequences (known as protein families). InterPro is a world leading protein family resource that merges information from 13 different specialist databases to present the user with comprehensive functional analysis of sequences. One of its member databases, Pfam, is a collection of protein domain families containing functional annotations. Both InterPro and Pfam are well-established primary resources in the field of protein research. In this application, we propose crucial developments to both of these resources in order to augment their utility, functionality and scalability, as well as uniquely position them to tackle imminent advances in the field. We will leverage pre-established links with other protein databases and concurrently build additional pipelines to develop and exchange the latest information between these existing and new resources.We will improve coverage of protein sequences originating from environmental sources by building families for novel sets (or clusters) of related proteins. Considering the fundamental association between protein structure and function, we will develop a pipeline that will not only import structural models for Pfam entries and present them via the website, but will also ensure that the models remain up to date. To increase coverage and functional annotations in both resources, we will integrate new resources to provide sub-domain classifications, and improve annotations through combined literature searches and enhanced curation tools. To refine annotations, we will adopt a new algorithm called TreeGrafter to InterProScan (our software package that performs automatic annotations of protein sequences), and integrate controlled vocabularies for protein attributes from databases like PANTHER with those already in InterPro. We will evaluate the performance of an upgraded version of the HMMER software that is widely used to build protein families, including Pfam, to improve future scalability. Finally, we will focus on eight genomes of agricultural importance, including chicken, salmon, and wheat, by systematically annotating 2000 associated entries in Pfam and by extension, InterPro.
蛋白质是从酶(例如负责发酵的实体)到转运蛋白(例如,血液中的血红蛋白)到机械结构(例如肌肉中的肌动蛋白和肌球蛋白)的生物学大分子。蛋白质合成为称为氨基酸的构件的线性聚合物。它们通常折叠成复杂的三维(3D)结构,通常与其他蛋白质和分子相互作用以执行其功能。蛋白质序列的知识可以促进对迄今未发现的酶的见解,该酶具有潜在的生物技术领域的应用,或者是制药行业感兴趣的新药物。详细了解蛋白质的功能结构,包括在3D结构中的氨基酸排列,使科学家能够诊断疾病以及设计更有效的酶。如今,我们基于现代高通量DNA测序(HTS)技术生成新蛋白质序列的能力远远超过了我们在功能上表征它们的能力。因此,大多数序列是通过识别新序列与少数实验表征示例之间的相似性来推断函数(即注释),通过计算注释。最近,HTS已直接应用于环境样品,以发现先前未培养的细菌和单细胞真核生物,并能够重建大型且复杂的基因组,例如植物。这种方法正在纠正蛋白质序列数据库中的许多历史偏见。但是,要使人类理解和利用这些数据,需要在功能上注释序列,这是使用从相关序列集(称为蛋白质家族)收集的信息来完成的。 InterPro是一种世界领先的蛋白质家族资源,它合并了来自13个不同专家数据库的信息,以向用户提供序列的全面功能分析。它的成员数据库之一PFAM是包含功能注释的蛋白质域家族的集合。 Interpro和PFAM都是蛋白质研究领域中公认的主要资源。在此应用程序中,我们提出了这两种资源的关键发展,以增强其效用,功能和可扩展性,并唯一地定位它们以应对该领域的即将出现的进步。我们将利用与其他蛋白质数据库的预先建立的链接,并同时建立其他管道来开发和交换这些现有资源和新资源之间的最新信息。我们将通过为新颖集合建立家庭(或簇)来改善源自环境来源的蛋白质序列(或)相关蛋白质。考虑到蛋白质结构与功能之间的基本关联,我们将开发一条管道,该管道不仅将导入PFAM条目并通过网站展示它们,而且还将确保模型保持最新状态。为了增加两种资源的覆盖范围和功能注释,我们将整合新的资源,以提供子域分类,并通过结合文献搜索和增强的策展工具来改善注释。为了完善注释,我们将采用一种称为TreeGrafter的新算法对Intercoscan(我们执行蛋白质序列自动注释的软件包),并整合来自Panther(例如Panther)蛋白质属性的受控词汇与已经在Interpro中的蛋白质属性。我们将评估HMMER软件的升级版本的性能,该版本被广泛用于建立包括PFAM在内的蛋白质家族,以提高未来的可伸缩性。最后,我们将通过系统地注释2000年PFAM和扩展条件,Interpro的条目,将重点关注八个农业重要性的基因组,包括鸡,鲑鱼和小麦。

项目成果

期刊论文数量(6)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
EMBL's European Bioinformatics Institute (EMBL-EBI) in 2022.
  • DOI:
    10.1093/nar/gkac1098
  • 发表时间:
    2023-01-06
  • 期刊:
  • 影响因子:
    14.9
  • 作者:
    Thakur, Matthew;Bateman, Alex;Brooksbank, Cath;Freeberg, Mallory;Harrison, Melissa;Hartley, Matthew;Keane, Thomas;Kleywegt, Gerard;Leach, Andrew;Levchenko, Mariia;Morgan, Sarah;McDonagh, Ellen M.;Orchard, Sandra;Papatheodorou, Irene;Velankar, Sameer;Vizcaino, Juan Antonio;Witham, Rick;Zdrazil, Barbara;McEntyre, Johanna
  • 通讯作者:
    McEntyre, Johanna
The InterPro protein families and domains database: 20 years on.
  • DOI:
    10.1093/nar/gkaa977
  • 发表时间:
    2021-01-08
  • 期刊:
  • 影响因子:
    14.9
  • 作者:
    Blum M;Chang HY;Chuguransky S;Grego T;Kandasaamy S;Mitchell A;Nuka G;Paysan-Lafosse T;Qureshi M;Raj S;Richardson L;Salazar GA;Williams L;Bork P;Bridge A;Gough J;Haft DH;Letunic I;Marchler-Bauer A;Mi H;Natale DA;Necci M;Orengo CA;Pandurangan AP;Rivoire C;Sigrist CJA;Sillitoe I;Thanki N;Thomas PD;Tosatto SCE;Wu CH;Bateman A;Finn RD
  • 通讯作者:
    Finn RD
Pfam: The protein families database in 2021.
  • DOI:
    10.1093/nar/gkaa913
  • 发表时间:
    2021-01-08
  • 期刊:
  • 影响因子:
    14.9
  • 作者:
    Mistry J;Chuguransky S;Williams L;Qureshi M;Salazar GA;Sonnhammer ELL;Tosatto SCE;Paladin L;Raj S;Richardson LJ;Finn RD;Bateman A
  • 通讯作者:
    Bateman A
The European Bioinformatics Institute (EMBL-EBI) in 2021.
  • DOI:
    10.1093/nar/gkab1127
  • 发表时间:
    2022-01-07
  • 期刊:
  • 影响因子:
    14.9
  • 作者:
    Cantelli G;Bateman A;Brooksbank C;Petrov AI;Malik-Sheriff RS;Ide-Smith M;Hermjakob H;Flicek P;Apweiler R;Birney E;McEntyre J
  • 通讯作者:
    McEntyre J
Reciprocal best structure hits: using AlphaFold models to discover distant homologues.
  • DOI:
    10.1093/bioadv/vbac072
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
  • 通讯作者:
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Alex Bateman其他文献

Bioinformatics Advance Access published May 31, 2007
生物信息学高级访问发表于 2007 年 5 月 31 日
  • DOI:
    10.1007/s10015-009-0735-5
  • 发表时间:
    2007
  • 期刊:
  • 影响因子:
    0.9
  • 作者:
    Alex Bateman
  • 通讯作者:
    Alex Bateman
Bioinformatics Applications Note Databases and Ontologies Codex: Exploration of Semantic Changes between Ontology Versions
生物信息学应用笔记数据库和本体法典:本体版本之间语义变化的探索
  • DOI:
  • 发表时间:
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Michael Hartung;Anika Groß;E. Rahm;Alex Bateman
  • 通讯作者:
    Alex Bateman

Alex Bateman的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Alex Bateman', 18)}}的其他基金

Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods
使用深度学习方法提高 InterPro、Pfam 和 FunFam 中功能蛋白注释的准确性、覆盖范围和可持续性
  • 批准号:
    BB/X018660/1
  • 财政年份:
    2024
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant
UKRI/BBSRC-NSF/BIO: Unifying Pfam protein sequence and ECOD structural classifications with structure models
UKRI/BBSRC-NSF/BIO:通过结构模型统一 Pfam 蛋白质序列和 ECOD 结构分类
  • 批准号:
    BB/X012492/1
  • 财政年份:
    2023
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant
Rfam: The community resource for RNA families
Rfam:RNA 家族的社区资源
  • 批准号:
    BB/S020462/1
  • 财政年份:
    2019
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant
RNAcentral, the RNA sequence database
RNAcentral,RNA 序列数据库
  • 批准号:
    BB/N019199/1
  • 财政年份:
    2017
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant
Rfam: Towards a sustainable resource for understanding the genomic functional ncRNA repertoire
Rfam:寻找了解基因组功能 ncRNA 库的可持续资源
  • 批准号:
    BB/M011690/1
  • 财政年份:
    2015
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant
Keeping pace with protein sequence annotation; consolidating and enhancing Pfam and InterPro's methodologies for functional prediction
与蛋白质序列注释保持同步;
  • 批准号:
    BB/L024136/1
  • 财政年份:
    2014
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant
The RNAcentral database of non-coding RNAs
非编码RNA的RNA中央数据库
  • 批准号:
    BB/J019232/1
  • 财政年份:
    2012
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant
Embracing new technologies to streamline improve and sustain InterPro and its contributing databases
采用新技术来简化、改进和维护 InterPro 及其贡献数据库
  • 批准号:
    BB/F010435/1
  • 财政年份:
    2008
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant

相似国自然基金

基于数据与知识驱动的湍流深度特征提取与本构关系建模
  • 批准号:
    12372288
  • 批准年份:
    2023
  • 资助金额:
    53 万元
  • 项目类别:
    面上项目
物理-数据混合驱动的复杂曲面多模态视觉检测理论与方法
  • 批准号:
    52375516
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
基于定制PINN的耦合非线性薛定谔系统中数据驱动怪波研究
  • 批准号:
    62305199
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
代理模型融合与迁移的分布式数据驱动进化计算方法
  • 批准号:
    62376097
  • 批准年份:
    2023
  • 资助金额:
    51 万元
  • 项目类别:
    面上项目
番茄时序图像表型数据驱动的生长动态监测与诊断模型构建
  • 批准号:
    32301692
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Understanding and exploiting novel therapeutic vulnerabilities of RIT1-driven lung cancer
了解和利用 RIT1 驱动的肺癌的新治疗漏洞
  • 批准号:
    10211377
  • 财政年份:
    2021
  • 资助金额:
    $ 103.95万
  • 项目类别:
Understanding and exploiting novel therapeutic vulnerabilities of RIT1-driven lung cancer
了解和利用 RIT1 驱动的肺癌的新治疗漏洞
  • 批准号:
    10378686
  • 财政年份:
    2021
  • 资助金额:
    $ 103.95万
  • 项目类别:
Understanding and exploiting novel therapeutic vulnerabilities of RIT1-driven lung cancer
了解和利用 RIT1 驱动的肺癌的新治疗漏洞
  • 批准号:
    10641671
  • 财政年份:
    2021
  • 资助金额:
    $ 103.95万
  • 项目类别:
Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam
利用数据驱动的计算方法来理解 InterPro 和 Pfam 中的蛋白质结构和功能
  • 批准号:
    BB/S020039/1
  • 财政年份:
    2020
  • 资助金额:
    $ 103.95万
  • 项目类别:
    Research Grant
Exploiting Ecology and Evolution to Prevent Therapy Resistance in EGFR-Driven Lung Cancer
利用生态学和进化来预防 EGFR 驱动的肺癌的治疗耐药性
  • 批准号:
    10737854
  • 财政年份:
    2019
  • 资助金额:
    $ 103.95万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了