RI: Medium: Broad-Coverage Semantic Parsing: Linguistic Representation Learning from Crowd-Scale Data

RI：中：广泛覆盖的语义解析：从人群规模数据中学习语言表示

基本信息

批准号：
1562364
负责人：
Noah Smith
金额：
$ 100.6万
依托单位：
University of Washington
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2016
资助国家：
美国
起止时间：
2016-09-01 至 2021-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1562364&HistoricalAwards=false
关键词：
RI Medium Broad Coverage Semantic

项目摘要

Automated understanding of text is a capability that will advance a wide range of language technologies, including information extraction, question answering, opinion analysis, and translation between languages. Such technologies have been in demand in the intelligence and defense communities for many years, and they now underlie many commercially available information-management tools. This project develops robust algorithms that understand natural language expressions by mapping them to formal representations of their meaning, a technique known as semantic parsing. For semantic parsing to be employed in technologies like those listed above, it needs to overcome the fundamental challenge of broad coverage, the ability to handle any text input, in multiple languages. This project meets this challenge by creating new methods for gathering large repositories of semantically annotated data at greatly reduced cost; these are then used to train much more accurate broad-coverage parsing models. The results of this project include open-source implementations, high-quality annotated corpora on an unprecedented scale, and reusable distributed semantic representations for use by the community of natural language processing researchers and practitioners. The goal of broad-coverage semantic parsing can only be achieved by simultaneously focusing on new, large scale sources of data with semantically meaningful annotations and new learning algorithms for inducing models with the representational capacity to make full use of such data. For scalable data collection, this project introduces new techniques that rely on two key complementary insights: (1) any reader who understands a text can answer questions about it, and (2) questions can be constructed whose answers probe any aspect of semantics that need to be recovered. These observations allow designing new data collection techniques that reduce the burden of semantic annotation by providing simple questions and answers about texts. This QA-style annotation can be done for any text in any language, given only native speakers, bypassing the significant effort that currently goes into defining detailed annotation standards. It also allows gathering new datasets on a much larger scale, and for more diverse text types, than ever before. In addition, the project develops new representation learning techniques that tie together a wide range of semantic annotation styles, including the new crowdsourced ones, in a multitask learning setup. Continuous representations (e.g., of word types) provide a powerful way to allow sharing of statistical strength across a large vocabulary, many of whose elements are sparsely observed. While past work has emphasized learning word embeddings, this project employs a shared continuous space ("framespace") that can capture abstract frames and roles used in predicate-argument (and logical) semantics. The usefulness of these representations depends on the tasks they are trained to perform, and using multiple related tasks can lead to benefits on all of them, by sharing of statistical strength across task-specific representations, across elements of the semantic lexicon, and even across languages.

对文本的自动理解是一种能力，它将推进广泛的语言技术，包括信息提取，问题回答，意见分析和语言之间的翻译。这些技术已经在情报和国防社区中需求多年，现在它们是许多市售信息管理工具的基础。该项目开发了强大的算法，这些算法通过将自然语言表达方式映射到其含义的形式表示，即一种称为语义解析的技术。为了使语义解析用于如上所述的技术，它需要克服广泛报道的基本挑战，以多种语言处理任何文本输入的能力。该项目通过创建新方法来收集大量语义注释数据的存储库来应对这一挑战；然后将这些用于训练更准确的宽覆盖解析模型。该项目的结果包括开源实施，以空前的规模上的高质量注释的语料库以及可重复使用的分布式语义表示，以供自然语言处理人员和从业者社区使用。仅通过同时专注于具有语义有意义的注释和新的学习算法的新的大规模数据来源，才能实现宽覆盖语义解析的目标，以诱导具有代表性的能力以充分利用此类数据的新学习算法。对于可扩展的数据收集，该项目介绍了依赖两个关键互补见解的新技术：（1）任何理解文本的读者都可以回答有关它的问题，并且（2）可以构建问题的问题，其答案探究了需要恢复的语义的任何方面。这些观察结果允许设计新的数据收集技术，通过提供有关文本的简单问题和答案来减轻语义注释的负担。只有以母语为母语的人，可以为任何语言的任何文本进行此质量检查式的注释，绕开了当前在定义详细注释标准中所做的重大努力。它还允许比以往任何时候都以更大的规模收集新的数据集，并且要多样化的文本类型。此外，该项目还开发了新的表示学习技术，这些学习技术将各种语义注释样式（包括新的众包）在多任务学习设置中融合在一起。连续表示（例如，单词类型）提供了一种有力的方法，可以在大型词汇范围内共享统计强度，其中许多元素被稀少地观察到。尽管过去的工作强调了学习单词嵌入，但该项目采用了共享的连续空间（“框架空间”），该空间可以捕获谓词argument（和逻辑）语义中使用的抽象框架和角色。这些表示形式的有用性取决于他们训练执行的任务，并且使用多个相关任务可以通过跨任务特定表示的统计强度，在语义词典的元素跨语言，甚至跨语言来带来好处。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Noah Smith其他文献

THE NORTH ATLANTIC TREATY ORGANIZATION AND UNITED STATES RELATIONSHIP: A STUDY OF ITS DEVELOPMENT AND POSSIBLE FUTURE

北大西洋公约组织与美国的关系：对其发展和可能的未来的研究

DOI：
发表时间：
2015
期刊：
影响因子：
0
作者：
Noah Smith
通讯作者：
Noah Smith

Buying health: assessing the impact of a consumer-side vegetable subsidy on purchasing, consumption and waste

购买健康：评估消费者侧蔬菜补贴对购买、消费和浪费的影响

DOI：
发表时间：
2015
期刊：
Public Health Nutrition
影响因子：
3.2
作者：
Noah Smith
通讯作者：
Noah Smith

Implications for cumulative and prolonged clinical improvement induced by cross-linked hyaluronic acid: An in vivo biochemical/microscopic study in humans.

交联透明质酸诱导的累积和长期临床改善的影响：人类体内生化/显微镜研究。

DOI：
10.1111/exd.14998
发表时间：
2024
期刊：
Experimental Dermatology
影响因子：
3.6
作者：
Frank Wang;T. Do;Noah Smith;J. Orringer;Sewon Kang;John J Voorhees;Gary J. Fisher
通讯作者：
Gary J. Fisher