Tuning Large language models to read biological literature
调整大型语言模型以阅读生物文献
基本信息
- 批准号:BB/Y514032/1
- 负责人:
- 金额:$ 23.78万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2024
- 资助国家:英国
- 起止时间:2024 至 无数据
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
In this application, we focus on two related bioinformatics challenges that require interpretation and knowledge extraction from biological and biomedical literature at great scale.First, gene/genome databases store information on gene function, which is ultimately derived from scientific experiments with results reported in publications. It is exceptionally time-consuming and expensive for human curators to read all relevant scientific literature, interpret what has reported about the function or localisation of gene products, and assign specific controlled vocabulary terms (e.g. Gene Ontology terms) or short free text descriptions (gene names or product descriptions.Second, there are enormous volumes of raw data sets accompanying scientific publications, which are deposited in archival databases from expensive omics experiments, including mass spectrometry (MS) proteomics. Our group and others develop and apply pipelines for re-analysing MS data for new purposes, including annotating genomes, discovery of post-translational modifications and building quantitative atlases of species or tissues amongst others. There is a major bottleneck interpreting the original experimental design, sample descriptions and software parameters, which are currently described in blocks of free text submitted to the archival repository or within Materials and Methods sections of accompanying articles. For both challenges, we believe that with the recent extraordinary improvements in large language models (LLMs), they can be retrained and harnessed for these tasks, to remove the bottleneck in knowledge extraction from literature. Our group has significant expertise in bioinformatics and machine learning, but limited expertise in natural language processing (NLP) to date. In this international partnering application, we are collaborating with a leading group in artificial intelligence and NLP from the University of Pennsylvania (UPenn). The UPenn team will help to guide us in the optimal approach for re-training open source LLMs, using training data that our team has amassed over many years. We will produce open source code for the two challenge areas, with a longer term plan to put these into production within the context of major international databases and consortia, within which we have leading roles.
在此应用中,我们关注两个相关的生物信息学挑战,这些挑战需要大规模的生物学和生物医学文献解释和知识提取。首先,基因/基因组数据库存储有关基因功能的信息,最终是从科学实验中得出的,其结果是在出版物中报道的结果。阅读所有相关科学文献,解释报告基因产品功能或本地化的内容,分配特定受控词汇术语(例如基因本体论术语)或简短的免费文本描述(基因(Gene))(基因)(基因名称或产品描述。第二,有大量的原始数据集随附科学出版物,这些数据集存放在昂贵的OMICS实验的档案数据库中,包括质谱(MS)蛋白质组学。 MS数据出于新用途,包括注释基因组,发现翻译后修饰以及建立物种或组织的定量图书馆。提交给档案存储库的自由文本或随附文章的材料和方法部分。对于这两个挑战,我们认为,通过大型语言模型(LLM)的最新非凡改进,可以对这些任务进行重新训练和利用,以消除从文献中提取知识的瓶颈。我们的小组在生物信息学和机器学习方面具有重要的专业知识,但迄今为止,自然语言处理(NLP)的专业知识有限。在这个国际合作申请中,我们正在与宾夕法尼亚大学(UPENN)的人工智能领域的领先小组合作。 UPENN团队将使用我们的团队多年来积累的培训数据来帮助我们采用最佳的重新训练开源LLM的方法。我们将为这两个挑战领域制作开源代码,并制定长期计划,将它们在主要的国际数据库和财团的背景下进行生产,这是我们在其中发挥主导作用的。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Antony McCabe其他文献
Antony McCabe的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
大环超分子对有机污染物及其降解中间体的自由基激发与诱导机制
- 批准号:52370168
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
基于fMRI大尺度时变网络变异性的个体ERP波形预测研究
- 批准号:82372084
- 批准年份:2023
- 资助金额:48 万元
- 项目类别:面上项目
开发区跨界合作网络的形成机理与区域效应:以三大城市群为例
- 批准号:42301183
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
抵挡汤早期干预抑制外膜滋养血管新生减轻血管钙化延缓2型糖尿病大血管病变发生的作用机制研究
- 批准号:82374247
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
利用衬底轨道过滤效应构筑大能隙二维拓扑绝缘体的研究
- 批准号:12304199
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
Collaborative Research: Conference: Large Language Models for Biological Discoveries (LLMs4Bio)
合作研究:会议:生物发现的大型语言模型 (LLMs4Bio)
- 批准号:
2411529 - 财政年份:2024
- 资助金额:
$ 23.78万 - 项目类别:
Standard Grant
Collaborative Research: Conference: Large Language Models for Biological Discoveries (LLMs4Bio)
合作研究:会议:生物发现的大型语言模型 (LLMs4Bio)
- 批准号:
2411530 - 财政年份:2024
- 资助金额:
$ 23.78万 - 项目类别:
Standard Grant
Investigating the potential for developing self-regulation in foreign language learners through the use of computer-based large language models and machine learning
通过使用基于计算机的大语言模型和机器学习来调查外语学习者自我调节的潜力
- 批准号:
24K04111 - 财政年份:2024
- 资助金额:
$ 23.78万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
SMILE - Semantic Modelling of Intent through Large-language Evaluations
SMILE - 通过大语言评估进行意图语义建模
- 批准号:
10097766 - 财政年份:2024
- 资助金额:
$ 23.78万 - 项目类别:
Collaborative R&D
Multi-agent Self-improving of Large Language Models (LLMs)
大型语言模型 (LLM) 的多智能体自我改进
- 批准号:
2903811 - 财政年份:2024
- 资助金额:
$ 23.78万 - 项目类别:
Studentship