CAREER: Multilingual Learning for Event Structures from Text
职业:从文本中学习事件结构的多语言
基本信息
- 批准号:2239570
- 负责人:
- 金额:$ 58.22万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-06-01 至 2028-05-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Natural language text is replete with important events in different areas (protests, cybersecurity breaches, elections, disease outbreaks, and business transactions). Identifying events to describe who did what to whom and their relations (causal, subevent, and coreferential) from a large amount of text can provide valuable data to support intelligent applications and data-driven decisions over various domains. However, current event structure extraction systems can only perform over text data for a few popular languages such as English, Chinese, Spanish, and Arabic. Text data from many other languages in the world thus cannot be processed by current event extraction systems. This limitation has hindered the coverage of data sources for the systems, introduced language biases in the extracted events, and delayed updates with latest events in local reports. Eventually, the collected event data from current techniques cannot comprehensively represent the latest dynamics over the world to effectively support decision making for important problems of national interests. To address the multilingual challenges, this project will develop event extraction and event-event relation extraction systems that can be effective for data in multiple languages, emphasizing on understudied and low-resource languages to improve the coverage of extracted data and promote democratization of technologies. In information retrieval, multilingual event structure data from the developed technologies can enable data management systems to quickly obtain answers and create summaries for broader user queries in many more languages. In cybersecurity, databases for extracted cyber attack events from multilingual sources can be used to generate more fine-grained and comprehensive reports to inform resource allocation decisions to better protect online activities. In socio-political science, coded conflict and meditation events from more languages can increase the scope and reduce biases of the data to support better decisions for foreign policy, civil war prevention, environmental challenges, or economic strategies.This project will address three fundamental limitations of existing multilingual learning research for event structure extraction: (i) the lack of multilingual datasets that provide data annotation for multiple languages to sufficiently support generalization evaluation of models across different language families, (ii) the limitations of current multilingual representation learning methods when aligning representations between languages to induce language-general features, and (iii) the scarcity of labeled data in different languages to train multilingual models. First, the project will annotate documents for all event extraction and event-event relation extraction tasks in many more languages using consistent schemas. The selected languages for annotation will be typologically diverse, understudied and low-resource to provide reliable multilingual evaluation data for the developed methods. Second, to boost cross-lingual performance for event structure extraction, this project will devise multilingual representation learning methods to enable effective knowledge transfer where models trained on labeled data of high-resource languages can be directly applied to data of other languages. The project will develop novel representation alignment methods for different languages using representation matching, augmentation, and language-general structure induction for text. Third, concerning limited training data for multilingual learning, this project will develop novel methods to automatically generate labeled data in different languages. The project will introduce techniques to mitigate noises in the generated data and optimize generation procedures to boost multilingual learning and performance. The research activities in this project will be closely integrated with education and outreach missions to broaden their impacts.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
自然语言文本充满了不同领域的重要事件(抗议、网络安全漏洞、选举、疾病爆发和商业交易)。从大量文本中识别事件来描述谁对谁做了什么及其关系(因果关系、子事件和共指)可以提供有价值的数据来支持各个领域的智能应用程序和数据驱动的决策。然而,当前的事件结构提取系统只能对几种流行语言的文本数据执行操作,例如英语、中文、西班牙语和阿拉伯语。因此,当前的事件提取系统无法处理来自世界上许多其他语言的文本数据。这种限制阻碍了系统数据源的覆盖范围,在提取的事件中引入了语言偏差,并延迟了本地报告中最新事件的更新。最终,现有技术采集的事件数据无法全面代表全球最新动态,无法有效支持国家利益重大问题的决策。为了应对多语言挑战,该项目将开发可有效处理多种语言数据的事件提取和事件-事件关系提取系统,重点关注研究不足和资源匮乏的语言,以提高提取数据的覆盖范围并促进技术民主化。在信息检索中,来自已开发技术的多语言事件结构数据可以使数据管理系统能够快速获取答案并以更多语言为更广泛的用户查询创建摘要。在网络安全领域,从多语言来源提取的网络攻击事件数据库可用于生成更细粒度和更全面的报告,为资源分配决策提供信息,以更好地保护在线活动。在社会政治科学中,来自更多语言的编码冲突和冥想事件可以扩大数据范围并减少数据偏差,以支持外交政策、内战预防、环境挑战或经济战略的更好决策。该项目将解决三个基本限制现有用于事件结构提取的多语言学习研究的不足:(i)缺乏为多种语言提供数据注释的多语言数据集,以充分支持不同语系模型的泛化评估,(ii)当前多语言表示学习方法在对齐时的局限性语言之间的表征来诱导语言的一般特征,以及(iii)缺乏不同语言的标记数据来训练多语言模型。首先,该项目将使用一致的模式以更多语言注释所有事件提取和事件-事件关系提取任务的文档。所选的注释语言类型多样、研究不足且资源匮乏,以便为开发的方法提供可靠的多语言评估数据。其次,为了提高事件结构提取的跨语言性能,该项目将设计多语言表示学习方法,以实现有效的知识转移,其中在高资源语言的标记数据上训练的模型可以直接应用于其他语言的数据。该项目将利用文本的表示匹配、增强和语言通用结构归纳,为不同语言开发新颖的表示对齐方法。第三,针对多语言学习的训练数据有限,该项目将开发新的方法来自动生成不同语言的标记数据。该项目将引入减轻生成数据中的噪音并优化生成程序的技术,以提高多语言学习和性能。该项目的研究活动将与教育和外展任务紧密结合,以扩大其影响。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(4)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Generating Labeled Data for Relation Extraction: A Meta Learning Approach with Joint GPT-2 Training
生成用于关系提取的标记数据:联合 GPT-2 训练的元学习方法
- DOI:10.18653/v1/2023.findings-acl.727
- 发表时间:2024-09-13
- 期刊:
- 影响因子:0
- 作者:Amir Pouran Ben Veyseh;Franck Dernoncourt;Bonan Min;Thien Huu Nguyen
- 通讯作者:Thien Huu Nguyen
Retrieving Relevant Context to Align Representations for Cross-lingual Event Detection
检索相关上下文以对齐跨语言事件检测的表示
- DOI:10.18653/v1/2023.findings-acl.135
- 发表时间:2024-09-13
- 期刊:
- 影响因子:2.7
- 作者:Chien Nguyen;Linh Van Ngo;Thien Huu Nguyen
- 通讯作者:Thien Huu Nguyen
Hybrid Knowledge Transfer for Improved Cross-Lingual Event Detection via Hierarchical Sample Selection
通过分层样本选择改进跨语言事件检测的混合知识转移
- DOI:10.18653/v1/2023.acl-long.296
- 发表时间:2024-09-13
- 期刊:
- 影响因子:4.1
- 作者:Luis Guzman Nateras;Franck Dernoncourt;Thien Huu Nguyen
- 通讯作者:Thien Huu Nguyen
Contextualized Soft Prompts for Extraction of Event Arguments
用于提取事件参数的上下文软提示
- DOI:10.18653/v1/2023.findings-acl.266
- 发表时间:2024-09-13
- 期刊:
- 影响因子:0
- 作者:Chien Van Nguyen;Hieu Man;Thien Huu Nguyen
- 通讯作者:Thien Huu Nguyen
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Thien Nguyen其他文献
Optimizing Bilingual Neural Transducer with Synthetic Code-switching Text Generation
通过合成语码转换文本生成来优化双语神经传感器
- DOI:
- 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
Thien Nguyen;Nathalie Tran;Liuhui Deng;T. F. D. Silva;Matthew Radzihovsky;Roger Hsiao;Henry Mason;Stefan Braun;E. McDermott;Dogan Can;P. Swietojanski;Lyan Verwimp;Sibel Oyman;Tresi Arvizo;Honza Silovsky;Arnab Ghoshal;M. Martel;Bharat Ram Ambati;Mohamed Ali - 通讯作者:
Mohamed Ali
An Efficient Hybrid Model for Vietnamese Sentiment Analysis
越南情绪分析的高效混合模型
- DOI:
10.1007/978-3-319-54472-4_22 - 发表时间:
2017-04-03 - 期刊:
- 影响因子:0
- 作者:
Thanh Hung Vo;Thien Nguyen;Hoang;T. Le - 通讯作者:
T. Le
A Systematic Review of Radiosurgery Versus Surgery for Neurofibromatosis Type 2 Vestibular Schwannomas.
放射外科与手术治疗 2 型前庭神经鞘瘤神经纤维瘤的系统评价。
- DOI:
10.1016/j.wneu.2017.08.159 - 发表时间:
2018 - 期刊:
- 影响因子:2
- 作者:
Lawrance K. Chung;Thien Nguyen;J. Sheppard;C. Lagman;S. Tenn;Percy Lee;T. Kaprealian;R. Chin;Quinton S. Gopen;I. Yang - 通讯作者:
I. Yang
Time-Resolved Velocity Measurements in a Matched Refractive Index Facility of Randomly Packed Spheres
随机填充球体匹配折射率设施中的时间分辨速度测量
- DOI:
10.1115/icone26-82425 - 发表时间:
2018-07-22 - 期刊:
- 影响因子:0
- 作者:
E. Kappes;M. Marciniak;A. Mills;R. Muyshondt;S. King;Thien Nguyen;Y. Hassan;V. Ugaz - 通讯作者:
V. Ugaz
Addressing the Challenges in the Placement of Seafloor Infrastructure on the East Breaks Slide-A Case Study: The Falcon Field (EB 579/623), Northwestern Gulf of Mexico
解决东部海底基础设施布局的挑战打破幻灯片 - 案例研究:墨西哥湾西北部 Falcon Field (EB 579/623)
- DOI:
10.4043/16748-ms - 发表时间:
2004 - 期刊:
- 影响因子:0
- 作者:
J. S. Hoffman;Michael J. Kaluza;R. Griffiths;Gary McCullough;J. Hall;Thien Nguyen - 通讯作者:
Thien Nguyen
Thien Nguyen的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Thien Nguyen', 18)}}的其他基金
Phase I IUCRC University of Oregon: Center for Big Learning
第一阶段 IUCCRC 俄勒冈大学:大学习中心
- 批准号:
1747798 - 财政年份:2018
- 资助金额:
$ 58.22万 - 项目类别:
Continuing Grant
相似国自然基金
多种语言文字环境下结合内容审计的网络舆情监测技术研究
- 批准号:61163052
- 批准年份:2011
- 资助金额:40.0 万元
- 项目类别:地区科学基金项目
相似海外基金
Expanding Access to Care for Marginalized Caregivers through Innovative Methods for Multicultural and Multilingual Adaptation of AI-Based Health Technologies
通过基于人工智能的医疗技术的多文化和多语言适应创新方法,扩大边缘化护理人员获得护理的机会
- 批准号:
10741177 - 财政年份:2023
- 资助金额:
$ 58.22万 - 项目类别:
Elementary Teacher Professional Learning of Equitable Engineering Pedagogies for Multilingual Students
多语言学生公平工程教育学的小学教师专业学习
- 批准号:
2300766 - 财政年份:2023
- 资助金额:
$ 58.22万 - 项目类别:
Standard Grant
Language Identity and Mental Health Disparities among Multilingual 1.5 Generation Asian/Asian American Immigrant Young Adults: A Mixed Methods Study
多语言 1.5 代亚裔/亚裔美国移民年轻人的语言认同和心理健康差异:混合方法研究
- 批准号:
10715803 - 财政年份:2023
- 资助金额:
$ 58.22万 - 项目类别:
The role and effects of Bilingual Learning Assistants in supporting multilingual learners in schools
双语学习助理在支持学校多语言学习者方面的作用和效果
- 批准号:
2737845 - 财政年份:2022
- 资助金额:
$ 58.22万 - 项目类别:
Studentship
Enabling Deep Learning for Multilingual Sociopragmatics
为多语言社交语用学提供深度学习
- 批准号:
RGPIN-2018-04267 - 财政年份:2022
- 资助金额:
$ 58.22万 - 项目类别:
Discovery Grants Program - Individual