Novel statistical models for text mining with applications to Chinese history and texts

用于文本挖掘的新颖统计模型及其在中国历史和文本中的应用

基本信息

  • 批准号:
    1208771
  • 负责人:
  • 金额:
    $ 40万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2012
  • 资助国家:
    美国
  • 起止时间:
    2012-07-01 至 2016-06-30
  • 项目状态:
    已结题

项目摘要

In this project, the investigators study a series of challenging problems of extracting information from Chinese text, including: (1) word/phrase discovery, (2) text segmentation, (3) technical term recognition, and (4) association discovery among technical terms. Different from alphabetical languages such as English, Chinese has many special properties: no word boundaries, no clear definition of words, traditionally no punctuation, and a unique grammar. Thus, it is problematic to apply most methods developed for alphabetical languages directly to Chinese. Moreover, the available methods for analyzing Chinese text in the literature have many limitations. Instead the investigators propose an advanced word dictionary model (AWDM) that can simultaneously achieve word discovery, text segmentation and technical term recognition, which are traditionally studied separately. The idea is to build up a word dictionary first by enumerating all word candidates satisfying a certain criterion from the texts and assign to each word candidate a latent word type label representing different types of technical terms (such as names, addresses, office titles, time labels, as well as background texts) and corresponding word usage frequencies. Then, a Markov dependence model among different words and word types is given to model the potential grammatical and semantic structure of the texts. With the help of the training data (i.e., lists of known technical terms), the AWDM can automatically select the most meaningful words from the huge space of word candidates, determine the word type for each word based on not only the content of the word but also the context around the word, and segment the texts based on both grammatical and semantic information. Compared to the existing methods in the literature, the AWDM enjoys a better efficiency due to the joint modeling of the grammatical and semantic information and the integrated analysis of word discovery, text segmentation and technical term recognition. Combined with other text mining tools, such as topic models and theme dictionary models, the proposed method will lead to a powerful multi-level (Chinese character level, word/phrase level, theme level, topic level) analysis platform for Chinese texts.With the explosive growth of the internet and digital technologies, large quantities of digitalized Chinese texts can be easily collected. For example, lots of Chinese historical documents written in traditional Chinese are now available in digital form; and, public media such as new papers, forums, blogs and microblogs, are producing huge amounts of Chinese text every day. Thus there is great appeal in developing text mining tools to automatically extract information from these data and create new knowledge. The ideas and approaches in this project may have significant impacts on how Chinese history will be studied. An efficient and reliable method for extracting information from the ever growing databases of digitized historical documents will enable researchers to analyze change over time based on large numbers of disaggregated data points, something impractical in the past. Furthermore, although originally designed for Chinese, these approaches have the potential to be applied to other Asian languages similar to Chinese, such as Japanese and Korean, and thus provide a powerful multi-language platform for the study of Asian history. In addition, the novel way of combatting the challenges in recognizing named entities studied in this project also has the potential to be extended to alphabetical languages such as English. Finally, the ideas and approaches studied in this project have the potential to be generalized into a systematic tool that digests any data flow of Chinese texts, and outputs a structured database that contains key information about the individuals and organizations described by the input data, thus making it easier for researchers to discover social network of all kinds of "units" in our social life. Various item association patterns discovered by our algorithms are also invaluable to the study of public media and sociology, and may help reveal new important epidemiological events and societal trends in a timely fashion. These types of information can have important implications in business decision making and governmental policy making.
在该项目中,研究人员研究了一系列从中文文本中提取信息的挑战性问题,包括:(1)词/短语发现,(2)文本分割,(3)技术术语识别,以及(4)技术之间的关联发现条款。与英语等字母语言不同,汉语有许多特殊的性质:没有单词边界、没有明确的单词定义、传统上没有标点符号、以及独特的语法。因此,将大多数为字母语言开发的方法直接应用于中文是有问题的。此外,现有的分析文献中中文文本的方法存在许多局限性。相反,研究人员提出了一种先进的单词词典模型(AWDM),它可以同时实现单词发现、文本分割和技术术语识别,而这些传统上是分开研究的。其想法是首先通过从文本中枚举满足特定标准的所有候选单词来建立单词词典,并为每个候选单词分配一个代表不同类型技术术语(例如姓名、地址、办公室头衔、时间)的潜在单词类型标签。标签,以及背景文本)和相应的单词使用频率。然后,给出不同单词和单词类型之间的马尔可夫依赖模型来对文本潜在的语法和语义结构进行建模。借助训练数据(即已知技术术语列表),AWDM 可以自动从庞大的候选词空间中选择最有意义的词,不仅根据词的内容来确定每个词的词类型还包括单词周围的上下文,并根据语法和语义信息对文本进行分段。与文献中现有的方法相比,AWDM由于语法和语义信息的联合建模以及词发现、文本分割和技术术语识别的集成分析而具有更好的效率。结合主题模型、主题词典模型等其他文本挖掘工具,该方法将形成一个强大的多层次(汉字级、词/短语级、主题级、主题级)中文文本分析平台。随着互联网和数字技术的爆炸式增长,大量的数字化中文文本可以被轻松收集。例如,许多用繁体中文写成的中国历史文献现在都以数字形式提供;而且,新报纸、论坛、博客和微博等公共媒体每天都在产生大量的中文文本。因此,开发文本挖掘工具来自动从这些数据中提取信息并创建新知识具有很大的吸引力。该项目的思想和方法可能会对中国历史的研究方式产生重大影响。一种从不断增长的数字化历史文献数据库中提取信息的高效可靠的方法将使研究人员能够根据大量分类数据点分析随时间的变化,这在过去是不切实际的。此外,虽然最初是为中文设计的,但这些方法有可能应用于与中文类似的其他亚洲语言,例如日语和韩语,从而为亚洲历史研究提供强大的多语言平台。此外,该项目中研究的应对识别命名实体挑战的新方法也有可能扩展到英语等字母语言。最后,该项目研究的想法和方法有可能推广为一种系统工具,可以消化中文文本的任何数据流,并输出一个结构化数据库,其中包含输入数据描述的个人和组织的关键信息,从而让研究人员更容易发现我们社会生活中各类“单位”的社交网络。我们的算法发现的各种项目关联模式对于公共媒体和社会学的研究也非常有价值,并且可能有助于及时揭示新的重要流行病学事件和社会趋势。这些类型的信息可以对商业决策和政府决策产生重要影响。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Jun Liu其他文献

Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
在 GPU 上启用快速 2 位 LLM:内存对齐、稀疏异常值和异步反量化
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jinhao Li;Shiyao Li;Jiaming Xu;Shan Huang;Yaoxiu Lian;Jun Liu;Yu Wang;Guohao Dai
  • 通讯作者:
    Guohao Dai
Atrial fibrillation: rhythm control offers no advantage over rate control for some, but not all.
心房颤动:对于某些人(但不是全部)来说,节律控制并不比心率控制有任何优势。
  • DOI:
  • 发表时间:
    2007
  • 期刊:
  • 影响因子:
    4.7
  • 作者:
    Yan Bo Li;C. Hu;Jun Liu;Yuan Xiu Chen;Zhe Qu;Jia Xu;Jiang;Jun Wan;Qi;Congxin Huang
  • 通讯作者:
    Congxin Huang
Independent Relationship of Lipoprotein(a) and Carotid Atherosclerosis With Long-Term Risk of Cardiovascular Disease.
脂蛋白(a)和颈动脉粥样硬化与心血管疾病长期风险的独立关系。
  • DOI:
    10.1161/jaha.123.033488
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    5.4
  • 作者:
    Y. Qi;Youling Duan;Q. Deng;Na Yang;Jia;Jiangtao Li;Piaopiao Hu;Jun Liu;Jing Liu
  • 通讯作者:
    Jing Liu
Bridge risk assessment using a hybrid AHP/DEA methodology - art. no. 1493
使用混合 AHP/DEA 方法进行桥梁风险评估 - 艺术。
  • DOI:
    10.2991/iske.2007.266
  • 发表时间:
    2007
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yun Wang;Jun Liu;Tms Elhag;L. M. López
  • 通讯作者:
    L. M. López
Studies on the Hot Forming and Cold-Die Quenching of AA6082 Tailor Welded Blanks
AA6082拼焊板热成型及冷模淬火研究
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jun Liu;Ailin Wang;Haoxiang Gao;Omer El Fakir;X. Luan;Li Liang Wang;Jianguo Lin
  • 通讯作者:
    Jianguo Lin

Jun Liu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Jun Liu', 18)}}的其他基金

REU Site: Molecular Biology and Genetics of Cell Signaling
REU 网站:细胞信号传导的分子生物学和遗传学
  • 批准号:
    2349577
  • 财政年份:
    2024
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
SCC-PG: Building a smart and connected rural community for improved healthcare access through the deployment of integrated mobility solutions
SCC-PG:通过部署集成移动解决方案,建设智能互联的农村社区,改善医疗保健服务
  • 批准号:
    2303284
  • 财政年份:
    2023
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
Collaborative Research: Bayesian and Semi-Bayesian Methods for Detecting Relationships in High Dimensions
合作研究:用于检测高维关系的贝叶斯和半贝叶斯方法
  • 批准号:
    2015411
  • 财政年份:
    2020
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
REU Site: Molecular Biology and Genetics of Cell Signaling
REU 网站:细胞信号传导的分子生物学和遗传学
  • 批准号:
    1950247
  • 财政年份:
    2020
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
Domain-Engineering Enabled Thermal Switching in Ferroelectric Materials
领域工程支持铁电材料中的热开关
  • 批准号:
    2011978
  • 财政年份:
    2020
  • 资助金额:
    $ 40万
  • 项目类别:
    Continuing Grant
CAREER: Pushing the Lower Limit of Thermal Conductivity in Layered Materials
事业:突破层状材料导热率的下限
  • 批准号:
    1943813
  • 财政年份:
    2020
  • 资助金额:
    $ 40万
  • 项目类别:
    Continuing Grant
Travel Support for Student Participation at the 2019 ASME-IMECE Micro and Nano Technology Forum; Salt Lake City, Utah; November 10-14, 2019
为学生参加2019 ASME-IMECE微纳米技术论坛提供差旅支持;
  • 批准号:
    2000224
  • 财政年份:
    2019
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
Collaborative Research: Novel Statistical Tools for Metagenomics and Metabolomics Data
合作研究:宏基因组学和代谢组学数据的新型统计工具
  • 批准号:
    1903139
  • 财政年份:
    2019
  • 资助金额:
    $ 40万
  • 项目类别:
    Continuing Grant
Collaborative Research: Theoretical and Methodological Frameworks for Causal Inference of Peer Effects
合作研究:同伴效应因果推断的理论和方法框架
  • 批准号:
    1712714
  • 财政年份:
    2017
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
Variable Selection via Inverse Modeling for Detecting Nonlinear Relationships
通过逆向建模进行变量选择以检测非线性关系
  • 批准号:
    1613035
  • 财政年份:
    2016
  • 资助金额:
    $ 40万
  • 项目类别:
    Continuing Grant

相似国自然基金

基于分枝过程的传播回溯问题统计推断研究
  • 批准号:
    12305040
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
潜在威胁小行星热物理特性统计特征研究
  • 批准号:
    12303066
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
高维因子模型中潜在误差序列的统计推断问题
  • 批准号:
    12301330
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于图的高维线性回归问题的统计理论与牛顿型算法
  • 批准号:
    12301420
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
去中心化分布式计算中数据异质性的非监督统计模型研究
  • 批准号:
    12301336
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Executive functions in urban Hispanic/Latino youth: exposure to mixture of arsenic and pesticides during childhood
城市西班牙裔/拉丁裔青年的执行功能:童年时期接触砷和农药的混合物
  • 批准号:
    10751106
  • 财政年份:
    2024
  • 资助金额:
    $ 40万
  • 项目类别:
Time series clustering to identify and translate time-varying multipollutant exposures for health studies
时间序列聚类可识别和转化随时间变化的多污染物暴露以进行健康研究
  • 批准号:
    10749341
  • 财政年份:
    2024
  • 资助金额:
    $ 40万
  • 项目类别:
Fluency from Flesh to Filament: Collation, Representation, and Analysis of Multi-Scale Neuroimaging data to Characterize and Diagnose Alzheimer's Disease
从肉体到细丝的流畅性:多尺度神经影像数据的整理、表示和分析,以表征和诊断阿尔茨海默病
  • 批准号:
    10462257
  • 财政年份:
    2023
  • 资助金额:
    $ 40万
  • 项目类别:
Data Science and Statistics Core
数据科学和统计核心
  • 批准号:
    10549489
  • 财政年份:
    2023
  • 资助金额:
    $ 40万
  • 项目类别:
Comprehensive and non-invasive prenatal screening of coding variation
全面、无创的编码变异产前筛查
  • 批准号:
    10678005
  • 财政年份:
    2023
  • 资助金额:
    $ 40万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了