Navigating Chemical Space with Natural Language Processing and Deep Learning
利用自然语言处理和深度学习驾驭化学空间
基本信息
- 批准号:EP/Y004167/1
- 负责人:
- 金额:$ 11.41万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2024
- 资助国家:英国
- 起止时间:2024 至 无数据
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Natural language processing (NLP) lies at the intersection between linguistics and computer science which aims to process and analyse human language, typically provided as written text. NLP is now strongly focused on the use of machine learning for challenging tasks with some revolutionary algorithms having been developed in the last few years. They now underpin a wide range of real-life applications, such as ChatGPT, virtual assistants and automatic text completion when we write emails. Innovative research ideas often come from integrating techniques and concepts across disciplines. For this discipline-hopping grant, we would like to explore how Transformer models, a ground-breaking deep learning algorithm developed by Google in 2017 which fuels majority of the current cutting-edge research in NLP, can be adapted to solve research challenges in chemistry. Chemical structures are usually three dimensional. However, they are also often converted into sequences, called SMILES. SMILES has a simple vocabulary of chemical elements and bond symbols and a few grammatical rules of how the chemical elements are positioned. Owing to this direct analogy to text sequences, through SMILES it is possible to use NLP algorithms to analyse chemical structures in a similar fashion as they are used to analyse text. For the proposed research, Dr Pang, a chemist will work with Dr Vulic, an NLP and machine learning expert in order to get up to speed with the latest developments in the field of NLP and to examine their further applicability in her domain of expertise. We will explore and utilise a concept which is now pervasive in machine learning and NLP, termed transfer learning, which 1) pretrains large general-purpose models, and 2) fine-tunes (i.e., specialises) those general models for specific tasks and applications, where labelled data are expensive to create (as they require expert knowledge and complex annotation protocols) and thus inherently scarce. Specifically, we will pretrain Transformer models to learn a latent representation of the chemical space defined by tens of millions of SMILES. This learned latent representation can then be used to predict molecular properties for a given chemical structure during fine-tuning. The advantage of this type of approach is that the resulting machine learning models rely less on the so-called labelled data (molecules with experimentally determined properties), which are time-consuming or even impossible to generate in chemistry considering the associated cost and experimental challenges. We will aim to make the Transformer models more computationally efficient and accurate using two latest machine learning techniques, termed sentence encoding and contrastive learning. We hope that this new molecular representation can complement existing molecular representation methods and provide an alternative approach to evaluate molecular structures against their properties, which underpins many research and development tasks in the chemical and pharmaceutical industries.
自然语言处理(NLP)位于语言学和计算机科学的交叉点,旨在处理和分析通常以书面文本形式提供的人类语言。 NLP 现在重点关注使用机器学习来完成具有挑战性的任务,过去几年开发了一些革命性的算法。它们现在支撑着广泛的现实生活应用程序,例如 ChatGPT、虚拟助理和我们编写电子邮件时的自动文本完成。创新的研究想法往往来自跨学科的技术和概念的整合。对于这笔跨学科资助,我们希望探索 Transformer 模型(谷歌于 2017 年开发的一种突破性深度学习算法,为当前 NLP 领域的大部分前沿研究提供动力)如何适用于解决化学领域的研究挑战。化学结构通常是三维的。然而,它们也经常被转换成序列,称为 SMILES。 SMILES 有一个简单的化学元素和键符号词汇表以及一些化学元素如何定位的语法规则。由于这种与文本序列的直接类比,通过 SMILES,可以使用 NLP 算法以与分析文本类似的方式来分析化学结构。对于拟议的研究,化学家 Pang 博士将与 NLP 和机器学习专家 Vulic 博士合作,以便跟上 NLP 领域的最新发展,并检验它们在她的专业领域的进一步适用性。我们将探索和利用一个现在在机器学习和 NLP 中普遍存在的概念,称为迁移学习,它 1)预训练大型通用模型,2)针对特定任务和应用程序微调(即专门化)这些通用模型,其中标记数据的创建成本很高(因为它们需要专业知识和复杂的注释协议),因此本质上是稀缺的。具体来说,我们将预训练 Transformer 模型,以学习由数千万个 SMILES 定义的化学空间的潜在表示。然后,这种学习到的潜在表示可用于在微调过程中预测给定化学结构的分子特性。这种方法的优点是,所得的机器学习模型较少依赖所谓的标记数据(具有实验确定属性的分子),考虑到相关的成本和实验挑战,这些数据在化学中非常耗时,甚至不可能生成。我们的目标是使用两种最新的机器学习技术(称为句子编码和对比学习)使 Transformer 模型在计算上更加高效和准确。我们希望这种新的分子表示可以补充现有的分子表示方法,并提供一种根据分子特性评估分子结构的替代方法,这为化学和制药行业的许多研究和开发任务奠定了基础。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Jiayun Pang其他文献
Mutagenesis of morphinone reductase induces multiple reactive configurations and identifies potential ambiguity in kinetic analysis of enzyme tunneling mechanisms.
吗啡酮还原酶的诱变会诱导多种反应构型,并识别酶隧道机制动力学分析中潜在的模糊性。
- DOI:
- 发表时间:
2007 - 期刊:
- 影响因子:15
- 作者:
C. Pudney;Sam Hay;Jiayun Pang;C. Costello;D. Leys;M. Sutcliffe;N. Scrutton - 通讯作者:
N. Scrutton
New insights into the multi-step reaction pathway of the reductive half-reaction catalysed by aromatic amine dehydrogenase: a QM/MM study.
对芳香胺脱氢酶催化还原半反应多步反应途径的新见解:QM/MM 研究。
- DOI:
10.1039/c003107k - 发表时间:
2010-04-27 - 期刊:
- 影响因子:4.9
- 作者:
Jiayun Pang;Nigel S. Scrutton;Sam P de Visser;M. Sutcliffe - 通讯作者:
M. Sutcliffe
Atomistic insight into the origin of the temperature-dependence of kinetic isotope effects and H-tunnelling in enzyme systems is revealed through combined experimental studies and biomolecular simulation.
通过结合实验研究和生物分子模拟,揭示了酶系统中动力学同位素效应和 H 隧道的温度依赖性起源的原子洞察。
- DOI:
- 发表时间:
2008 - 期刊:
- 影响因子:3.9
- 作者:
Sam Hay;C. Pudney;P. Hothi;L. Johannissen;Laura Masgrau;Jiayun Pang;D. Leys;M. Sutcliffe;N. Scrutton - 通讯作者:
N. Scrutton
Using natural language processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters
- DOI:
10.1039/d3dd00119a - 发表时间:
2023-11 - 期刊:
- 影响因子:0
- 作者:
Jiayun Pang;Alexander W. R. Pine;Abdulai Sulemana - 通讯作者:
Abdulai Sulemana
Protein motions during catalysis by dihydrofolate reductases
二氢叶酸还原酶催化过程中的蛋白质运动
- DOI:
- 发表时间:
2006 - 期刊:
- 影响因子:0
- 作者:
R. Allemann;R. Evans;Lai;G. Maglia;Jiayun Pang;Robert J Rodriguez;P. Shrimpton;Richard S. Swanwick - 通讯作者:
Richard S. Swanwick
Jiayun Pang的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
空间限域硒化物异质结构的化学构筑及外场增强锂硫电池研究
- 批准号:22361035
- 批准年份:2023
- 资助金额:32 万元
- 项目类别:地区科学基金项目
空间环境中印制电路板在黑曲霉与电化学交互作用下的腐蚀机理
- 批准号:52371048
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
珊瑚微结构与化学组成的空间分布及其对生物钙化机理的限定研究
- 批准号:42273012
- 批准年份:2022
- 资助金额:58 万元
- 项目类别:面上项目
应用糖代谢化学标记研究空间辐射及微重力对神经细胞唾液酸化的生物学效应
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
高空间分辨原位电化学研究新方法:电化学—振幅调制静电力显微镜联用技术
- 批准号:
- 批准年份:2022
- 资助金额:54 万元
- 项目类别:面上项目
相似海外基金
工芸表現学習のための没入型視覚化空間(Immersive visualization space)の開発と試行
学习工艺表达的沉浸式可视化空间的开发与试用
- 批准号:
24K06316 - 财政年份:2024
- 资助金额:
$ 11.41万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Collaborative Research: IIBR: Innovation: Bioinformatics: Linking Chemical and Biological Space: Deep Learning and Experimentation for Property-Controlled Molecule Generation
合作研究:IIBR:创新:生物信息学:连接化学和生物空间:属性控制分子生成的深度学习和实验
- 批准号:
2318829 - 财政年份:2023
- 资助金额:
$ 11.41万 - 项目类别:
Continuing Grant
Interleaved 1H/23Na imaging for invasive and proliferative phenotypes of brain tumors
用于脑肿瘤侵袭性和增殖表型的交错 1H/23Na 成像
- 批准号:
10634269 - 财政年份:2023
- 资助金额:
$ 11.41万 - 项目类别:
Chemical Biology Approaches to Studying Collagen IV Stability
研究胶原蛋白 IV 稳定性的化学生物学方法
- 批准号:
10723042 - 财政年份:2023
- 资助金额:
$ 11.41万 - 项目类别:
Cytosolic DNA sensing instructs resident macrophage vitality and organismal longevity
胞质 DNA 传感指示常驻巨噬细胞活力和生物体寿命
- 批准号:
10901044 - 财政年份:2023
- 资助金额:
$ 11.41万 - 项目类别: