BBSRC-NSF/BIO: An AI-based domain classification platform for 200 million 3D-models of proteins to reveal protein evolution
BBSRC-NSF/BIO:基于人工智能的域分类平台,可用于 2 亿个蛋白质 3D 模型,以揭示蛋白质进化
基本信息
- 批准号:BB/Y001117/1
- 负责人:
- 金额:$ 34.21万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2024
- 资助国家:英国
- 起止时间:2024 至 无数据
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Proteins play a major role in most important processes in life, such as the digestion of nutrients, immune response, and cellular regulation. They are comprised of long polymers that fold into compact globular forms known as domains. Most proteins have at least two domains and some are composed of dozens. Domains tend to be associated with specific functions, although sometimes an important function will result from combining multiple domains. 3D structure data and models are particularly valuable for detecting the pockets and surface features linked to domain function. Determining the structure and orientations of the constituent domains is important for understanding the overall function of the protein and the dynamic conformational changes linked to that. Until recently, structural data for proteins was very sparse, with <1% of all known proteins experimentally characterised. Whilst structures can be predicted with reasonable accuracy when the structure of a close relative is known, for a significant proportion of proteins such data did not exist. Even for important organisms like humans or wheat, <50% of proteins had structural data accurate enough to understand the structural impacts of changes in the genes coding the proteins.This situation changed dramatically in 2021 when DeepMind's AlphaFold AI system succeeded in predicting protein structures of comparable quality to experimentally characterised proteins. In August 2022, DeepMind released >214 million protein structures for all known proteins. Whilst recent analyses showed that in some cases AlphaFold models are not accurate enough for detailed studies, largely because the data needed to make the prediction is still too sparse, the AlphaFold data still massively increases the amount of high-quality structural data available for understanding the mechanisms by which proteins function.Identifying constituent domains in a protein is not trivial. This project will exploit powerful AI technologies to more accurately predict domain boundaries. Preliminary studies are already showing significant improvements. We will apply multiple domain detection algorithms independently developed by two world-renowned protein domain classification teams (ECOD, CATH), both of whom have long track records in successfully automating domain detection. Their methods employ complementary strategies that can be combined to give a consensus prediction where agreement in assignments reflects higher confidence levels. Another major challenge will be coping with the scale of the data. Even allowing for a 50% loss due to poor model quality, the data represents a >200-fold increase in the data already classified in these evolutionary resources. An existing domain assignment and classification pipeline (3D-SCAFOLD) built to integrate experimental domain data from two resources (SCOP, CATH) will be re-engineered to incorporate ECOD (which is much more comprehensive than SCOP) and capture the vast predicted data from AlphaFold. This will require new and more efficient workflows that parallelise the processes. Furthermore, the pipeline will be more complex as additional steps will be necessary to determine the model quality and remove poor models. We will also adapt access to the webpages and APIs to allow users to request targeted subsets and perform more complex queries needed by the increase in the scale of the data.In addition, we expect that many large, more complex multidomain proteins will be very challenging, leading to discrepancies between the results provided by the different resources. We will hold workshops for the teams to agree on consensus assignments.To cope with the scale of the data, we will initially target proteins in pathogenic organisms, crops essential for food security, and protein families linked to human health and well-being, including enzyme families important for environmental remediation and the production of commercially valuable compounds.
蛋白质在生活中最重要的过程中起主要作用,例如营养素的消化,免疫反应和细胞调节。它们由长聚合物组成,这些聚合物折叠成紧凑的球状形式称为域。大多数蛋白质至少有两个域,有些蛋白质由数十个域组成。域倾向于与特定功能相关联,尽管有时将重要的功能组合起来会导致多个域。 3D结构数据和模型对于检测与域功能相关的口袋和表面特征特别有价值。 Determining the structure and orientations of the constituent domains is important for understanding the overall function of the protein and the dynamic conformational changes linked to that.直到最近,蛋白质的结构数据非常稀疏,<1%的所有已知蛋白质实验表征。 Whilst structures can be predicted with reasonable accuracy when the structure of a close relative is known, for a significant proportion of proteins such data did not exist. Even for important organisms like humans or wheat, <50% of proteins had structural data accurate enough to understand the structural impacts of changes in the genes coding the proteins.This situation changed dramatically in 2021 when DeepMind's AlphaFold AI system succeeded in predicting protein structures of comparable quality to experimentally characterised proteins. 2022年8月,DeepMind为所有已知蛋白质释放了> 2.14亿个蛋白质结构。 Whilst recent analyses showed that in some cases AlphaFold models are not accurate enough for detailed studies, largely because the data needed to make the prediction is still too sparse, the AlphaFold data still massively increases the amount of high-quality structural data available for understanding the mechanisms by which proteins function.Identifying constituent domains in a protein is not trivial.该项目将利用强大的AI技术来更准确地预测域边界。初步研究已经显示出显着改善。 We will apply multiple domain detection algorithms independently developed by two world-renowned protein domain classification teams (ECOD, CATH), both of whom have long track records in successfully automating domain detection. Their methods employ complementary strategies that can be combined to give a consensus prediction where agreement in assignments reflects higher confidence levels.另一个主要挑战将是应对数据的规模。 Even allowing for a 50% loss due to poor model quality, the data represents a >200-fold increase in the data already classified in these evolutionary resources.构建的现有域分配和分类管道(3D-SCAFOLD)将重新设计来自两个资源(SCOP,CATH)的实验域数据,以结合ECOD(比SCOP更全面)并捕获Alphafold的广泛预测数据。这将需要新的,更高效的工作流程,使流程并行。此外,该管道将更加复杂,因为需要其他步骤来确定模型质量并删除差模型。 We will also adapt access to the webpages and APIs to allow users to request targeted subsets and perform more complex queries needed by the increase in the scale of the data.In addition, we expect that many large, more complex multidomain proteins will be very challenging, leading to discrepancies between the results provided by the different resources. We will hold workshops for the teams to agree on consensus assignments.To cope with the scale of the data, we will initially target proteins in pathogenic organisms, crops essential for food security, and protein families linked to human health and well-being, including enzyme families important for environmental remediation and the production of commercially valuable compounds.
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Christine Orengo其他文献
Globalization : Approaches to Diversities
全球化:实现多元化的途径
- DOI:
- 发表时间:
2012 - 期刊:
- 影响因子:0
- 作者:
Benoit H Dessailly;Natalie L Dawson;Kenji Mizuguchi;Christine Orengo;Hector Cuadra-Montiel - 通讯作者:
Hector Cuadra-Montiel
Christine Orengo的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Christine Orengo', 18)}}的其他基金
ProtFunAI: AI based methods for functional annotation of proteins in crop genomes
ProtFunAI:基于人工智能的作物基因组蛋白质功能注释方法
- 批准号:
BB/Y514044/1 - 财政年份:2024
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods PID 7012435
使用深度学习方法提高 InterPro、Pfam 和 FunFam 中功能蛋白注释的准确性、覆盖范围和可持续性 PID 7012435
- 批准号:
BB/X018563/1 - 财政年份:2024
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
Transforming the Structural Landscape of CATH to Aid Variant Analyses in Human and Agricultural Organisms and their Pathogens
改变 CATH 的结构景观以帮助人类和农业生物体及其病原体的变异分析
- 批准号:
BB/W018802/1 - 财政年份:2022
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
Unlocking the chemical potential of plants: Predicting function from DNA sequence for complex enzyme superfamilies
释放植物的化学潜力:根据复杂酶超家族的 DNA 序列预测功能
- 批准号:
BB/V014722/1 - 财政年份:2022
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
CATH-FunVar - Predicting Viral and Human Variants Affecting COVID-19 Susceptibility and Severity and Repurposing Therapeutics
CATH-FunVar - 预测影响 COVID-19 易感性和严重程度的病毒和人类变异并重新调整治疗用途
- 批准号:
BB/W003368/1 - 财政年份:2021
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
3D-Gateway - Gateway to protein structure and function
3D-Gateway - 蛋白质结构和功能的门户
- 批准号:
BB/S020144/1 - 财政年份:2020
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam
利用数据驱动的计算方法来理解 InterPro 和 Pfam 中的蛋白质结构和功能
- 批准号:
BB/S020039/1 - 财政年份:2020
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
SENSE - Screening of ENvironmental SEquences to discover novel protein functions, using informatics target selection and high-throughput validation
SENSE - 使用信息学目标选择和高通量验证筛选环境序列以发现新的蛋白质功能
- 批准号:
BB/T002735/1 - 财政年份:2020
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
BBSRC-NSF/BIO Expanding the fold library in the twilight zone to facilitate structure determination of macromolecular machines
BBSRC-NSF/BIO 扩展暮光区折叠库以促进大分子机器的结构测定
- 批准号:
BB/S016007/1 - 财政年份:2020
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
Increasing the Coverage and Accuracy of CATH for Comparative Genomics and Variant Interpretation
提高比较基因组学和变异解释的 CATH 的覆盖范围和准确性
- 批准号:
BB/R014892/1 - 财政年份:2018
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
相似国自然基金
SYNJ1蛋白片段通过促进突触蛋白NSF聚集在帕金森病发生中的机制研究
- 批准号:82201590
- 批准年份:2022
- 资助金额:30.00 万元
- 项目类别:青年科学基金项目
SYNJ1蛋白片段通过促进突触蛋白NSF聚集在帕金森病发生中的机制研究
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
NSF蛋白亚硝基化修饰所介导的GluA2 containing-AMPA受体膜稳定性在卒中后抑郁中的作用及机制研究
- 批准号:82071300
- 批准年份:2020
- 资助金额:55 万元
- 项目类别:面上项目
参加中美(NSFC-NSF)生物多样性项目评审会
- 批准号:
- 批准年份:2019
- 资助金额:2 万元
- 项目类别:国际(地区)合作与交流项目
参加中美(NSFC-NSF)生物多样性项目评审会
- 批准号:31981220281
- 批准年份:2019
- 资助金额:2.3 万元
- 项目类别:国际(地区)合作与交流项目
相似海外基金
BBSRC-NSF/BIO: An AI-based domain classification platform for 200 million 3D-models of proteins to reveal protein evolution
BBSRC-NSF/BIO:基于人工智能的域分类平台,可用于 2 亿个蛋白质 3D 模型,以揭示蛋白质进化
- 批准号:
BB/Y000455/1 - 财政年份:2024
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
22-BBSRC/NSF-BIO Building synthetic regulatory units to understand the complexity of mammalian gene expression
22-BBSRC/NSF-BIO 构建合成调控单元以了解哺乳动物基因表达的复杂性
- 批准号:
BB/Y008898/1 - 财政年份:2024
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
20-BBSRC/NSF-BIO Regulatory control of innate immune response in marine invertebrates
20-BBSRC/NSF-BIO 海洋无脊椎动物先天免疫反应的调节控制
- 批准号:
BB/W017865/1 - 财政年份:2024
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
22-BBSRC/NSF-BIO - Interpretable & Noise-robust Machine Learning for Neurophysiology
22-BBSRC/NSF-BIO - 可解释
- 批准号:
BB/Y008758/1 - 财政年份:2024
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant
22-BBSRC/NSF-BIO: Community-dependent CRISPR-cas evolution and robust community function
22-BBSRC/NSF-BIO:群落依赖性 CRISPR-cas 进化和强大的群落功能
- 批准号:
BB/Y008774/1 - 财政年份:2024
- 资助金额:
$ 34.21万 - 项目类别:
Research Grant