III: Small: Accessible and Interpretable Machine Reading Methods for Extracting Structured Information from Text
III:小:从文本中提取结构化信息的可访问且可解释的机器阅读方法
基本信息
- 批准号:2006583
- 负责人:
- 金额:$ 49.99万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-07-15 至 2024-06-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Computers, the Internet, and cheap storage promote the acquisition and collection of vast quantities of data. There is a seemingly infinite supply of text documents which contain critical scientific, socio-political, and business insights – far more than can be read by a human. Within the natural language processing (NLP) domain, the field of information extraction (IE) targets exactly this problem, but it requires its practitioners to have expertise either in linguistics, machine learning, or both. Consequently, the majority of the advancements in the field of IE are difficult to access by domain experts such as epidemiologists, biologists, and economists. This project will empower these domain experts to develop and deploy IE systems targeting their own particular needs without requiring expertise in NLP, linguistics, or machine learning, which, in turn, will dramatically impact the process, pace, and productivity of conducting critical scientific research and collaboration, as experts could have far more ready access to the knowledge most essential to them and their research (both in their domain and adjacent domains). The products of this work will be shared across the scientific community through a series of outreach efforts such as video courses, publications, and a workshop at a high-visibility conference. To broaden participation, outreach activities (including deepening collaborations with institutional colleagues and local community outreach) will be done with an emphasis on groups who are historically underrepresented in academia. The planned work will be accomplished through a human-technology partnership, where domain experts specify their information need at the level they find intuitive, (e.g., phosphorylation acts on proteins). The system will then extend techniques from the adjacent field of program synthesis to convert these high-level, abstract specifications into low-level grammars (i.e., sets of hierarchical information extraction rules) which can be executed in order to extract the desired information from text. Crucially, the specification requires no linguistic knowledge, making it accessible to a broader population. The need for domain-specific entities (e.g., names of proteins) will be addressed through an entity discovery procedure that incorporates techniques for detecting multi-word entity candidates and inferring their semantic types (e.g., PROTEIN). To ensure that the product of the system is readily interpretable and easily extensible, a series of user studies will be conducted to discover the key characteristics of rules and grammars that affect their interpretability and maintainability. Through this combined effort, several datasets and software products will be produced and made available to the wider community. This includes (but is not limited to) (a) a dataset of event specifications and the corresponding automatically synthesized rules for several domains (b) a dataset of human judgements of grammar interpretability, and (c) models which can serve as automatic proxies for the more expensive human evaluation of interpretability. All data will be anonymized and released under the Open Data Commons Public Domain Dedication & License, which allows users to freely share, modify, and use this data, in the hope that this effort will be exploited further. To ensure as wide an audience as possible, the software and techniques developed in this work including the rule synthesis framework, a pipeline for entity discovery, and any generated user interfaces, will be released as open-source software products (under an Apache 2.0 open source license).This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
计算机、互联网和廉价的存储促进了海量数据的获取和收集,似乎有无穷无尽的文本文档,其中包含重要的科学、社会政治和商业见解——远远超出了人类的阅读能力。在自然语言处理 (NLP) 领域,信息提取 (IE) 领域正是针对这个问题,但它要求其从业者具备语言学、机器学习或两者的专业知识。该领域的大部分进展IE 的领域专家很难访问这些内容作为流行病学家、生物学家和经济学家,该项目将使这些领域专家能够开发和部署针对自己特定需求的 IE 系统,而无需 NLP、语言学或机器学习方面的专业知识,这反过来又会极大地影响流程、速度。以及进行关键科学研究和协作的生产力,因为专家可以更容易地获取对他们及其研究最重要的知识(无论是在他们的领域还是相邻领域)。这项工作的成果将在整个科学领域共享。社区通过视频等一系列外展活动为了扩大参与范围,将开展推广活动(包括深化与机构同事和当地社区的合作),重点关注学术界历来代表性不足的群体。将通过人类技术合作伙伴关系来完成,其中领域专家在他们认为直观的水平上指定他们的信息需求(例如,磷酸化作用于蛋白质),然后系统将扩展程序合成的相邻领域的技术以转换这些信息。将高级抽象规范转化为低级语法(即,分层信息提取规则集),可以执行这些语法以从文本中提取所需的信息,最重要的是,该规范不需要语言知识,因此更广泛的人可以访问它。对特定领域实体(例如蛋白质名称)的需求将通过实体发现程序来解决,该程序结合了检测多词实体候选者并推断其语义类型(例如蛋白质)的技术。由于该系统易于解释且易于扩展,因此将进行一系列用户研究,以发现影响其可解释性和可维护性的规则和语法的关键特征。通过这种共同努力,将产生并提供多个数据集和软件产品。这包括(但不限于)(a)事件规范的数据集和多个领域的相应自动合成规则(b)人类语法可解释性判断的数据集,以及(c)可以的模型作为自动代理对可解释性进行更昂贵的人工评估。所有数据都将匿名并在开放数据共享公共领域奉献和许可下发布,允许用户自由共享、修改和使用这些数据,希望这一努力能够得到进一步利用。为了确保尽可能广泛的受众,本工作中开发的软件和技术,包括规则综合框架、实体发现管道以及任何生成的用户界面,将作为开源软件产品发布(在 Apache 2.0 开放协议下)源码许可)。此奖通过使用基金会的智力价值和更广泛的影响审查标准进行评估,NSF 的法定使命被认为值得支持。
项目成果
期刊论文数量(9)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Neural-Guided Program Synthesis of Information Extraction Rules Using Self-Supervision
使用自我监督的信息提取规则的神经引导程序合成
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Noriega-Atala, Enrique;Vacareanu, Robert;Hahn-Powell, Gus;Valenzuela-Escárcega, Marco A.
- 通讯作者:Valenzuela-Escárcega, Marco A.
From Examples to Rules: Neural Guided Rule Synthesis for Information Extraction
- DOI:
- 发表时间:2022-01
- 期刊:
- 影响因子:0
- 作者:Robert Vacareanu;M. A. Valenzuela-Escarcega;George C. G. Barbosa;Rebecca Sharp;M. Surdeanu
- 通讯作者:Robert Vacareanu;M. A. Valenzuela-Escarcega;George C. G. Barbosa;Rebecca Sharp;M. Surdeanu
Bootstrapping Neural Relation and Explanation Classifiers
自举神经关系和解释分类器
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Zheng, Tang;Surdeanu, Mihai
- 通讯作者:Surdeanu, Mihai
Do Transformer Networks Improve the Discovery of Inference Rules from Text?
Transformer 网络是否可以改进从文本中发现推理规则?
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Rahimi, Mahdi;Surdeanu, Mihai
- 通讯作者:Surdeanu, Mihai
Syntax-driven Data Augmentation for Named Entity Recognition
用于命名实体识别的语法驱动的数据增强
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Sutiono, Arie;Hahn-Powell, Gus
- 通讯作者:Hahn-Powell, Gus
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Mihai Surdeanu其他文献
Information Extraction from Legal Wills: How Well Does GPT-4 Do?
从法律遗嘱中提取信息:GPT-4 做得如何?
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
A. Kwak;Cheonkam Jeong;Gaetano Forte;Derek E. Bambauer;Clayton T. Morrison;Mihai Surdeanu - 通讯作者:
Mihai Surdeanu
On Learning Bipolar Gradual Argumentation Semantics with Neural Networks
用神经网络学习双极渐进论证语义
- DOI:
10.5220/0012448300003636 - 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Caren Al Anaissy;Sandeep Suntwal;Mihai Surdeanu;Srdjan Vesic - 通讯作者:
Srdjan Vesic
Retrieval Augmented Generation of Subjective Explanations for Socioeconomic Scenarios
社会经济情景主观解释的检索增强生成
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Razvan;Maria Alexeeva;K. Alcock;Nargiza Ludgate;Cheonkam Jeong;Zara Fatima Abdurahaman;Prateek Puri;Brian Kirchhoff;Santadarshan Sadhu;Mihai Surdeanu - 通讯作者:
Mihai Surdeanu
Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels
逐层量化:一种实用且有效的方法,用于量化超越整数位级别的 LLM
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Razvan;Vikas Yadav;Rishabh Maheshwary;Paul;Sathwik Tejaswi Madhusudhan;Mihai Surdeanu - 通讯作者:
Mihai Surdeanu
Mihai Surdeanu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
单细胞分辨率下的石杉碱甲介导小胶质细胞极化表型抗缺血性脑卒中的机制研究
- 批准号:82304883
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
小分子无半胱氨酸蛋白调控生防真菌杀虫活性的作用与机理
- 批准号:32372613
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
诊疗一体化PS-Hc@MB协同训练介导脑小血管病康复的作用及机制研究
- 批准号:82372561
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
非小细胞肺癌MECOM/HBB通路介导血红素代谢异常并抑制肿瘤起始细胞铁死亡的机制研究
- 批准号:82373082
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
FATP2/HILPDA/SLC7A11轴介导肿瘤相关中性粒细胞脂代谢重编程影响非小细胞肺癌放疗免疫的作用和机制研究
- 批准号:82373304
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: HCC: Small: End-User Guided Search and Optimization for Accessible Product Customization and Design
协作研究:HCC:小型:最终用户引导的搜索和优化,以实现无障碍产品定制和设计
- 批准号:
2327136 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
LIMBER UniLeg: A digital and additive manufacturing approach for accessible prosthetic care.
LIMBER UniLeg:一种数字化增材制造方法,可实现无障碍的假肢护理。
- 批准号:
10761671 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Simple and Accessible Microfluidic Platform for Single Molecule Sequence Profiling of Tumor-derived DNA within Liquid Biopsies
简单易用的微流体平台,用于液体活检中肿瘤来源 DNA 的单分子序列分析
- 批准号:
10699214 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Novel, On-demand VR for Accessible, Practical, and Engaging therapy (NO VAPE)
新颖的按需 VR,可实现无障碍、实用且引人入胜的治疗(无 VAPE)
- 批准号:
10740956 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Collaborative Research: HCC: Small: End-User Guided Search and Optimization for Accessible Product Customization and Design
协作研究:HCC:小型:最终用户引导的搜索和优化,以实现无障碍产品定制和设计
- 批准号:
2327137 - 财政年份:2023
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant