Novel statistical models for text mining with applications to Chinese history and texts

用于文本挖掘的新颖统计模型及其在中国历史和文本中的应用

基本信息

  • 批准号:
    1208771
  • 负责人:
  • 金额:
    $ 40万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2012
  • 资助国家:
    美国
  • 起止时间:
    2012-07-01 至 2016-06-30
  • 项目状态:
    已结题

项目摘要

In this project, the investigators study a series of challenging problems of extracting information from Chinese text, including: (1) word/phrase discovery, (2) text segmentation, (3) technical term recognition, and (4) association discovery among technical terms. Different from alphabetical languages such as English, Chinese has many special properties: no word boundaries, no clear definition of words, traditionally no punctuation, and a unique grammar. Thus, it is problematic to apply most methods developed for alphabetical languages directly to Chinese. Moreover, the available methods for analyzing Chinese text in the literature have many limitations. Instead the investigators propose an advanced word dictionary model (AWDM) that can simultaneously achieve word discovery, text segmentation and technical term recognition, which are traditionally studied separately. The idea is to build up a word dictionary first by enumerating all word candidates satisfying a certain criterion from the texts and assign to each word candidate a latent word type label representing different types of technical terms (such as names, addresses, office titles, time labels, as well as background texts) and corresponding word usage frequencies. Then, a Markov dependence model among different words and word types is given to model the potential grammatical and semantic structure of the texts. With the help of the training data (i.e., lists of known technical terms), the AWDM can automatically select the most meaningful words from the huge space of word candidates, determine the word type for each word based on not only the content of the word but also the context around the word, and segment the texts based on both grammatical and semantic information. Compared to the existing methods in the literature, the AWDM enjoys a better efficiency due to the joint modeling of the grammatical and semantic information and the integrated analysis of word discovery, text segmentation and technical term recognition. Combined with other text mining tools, such as topic models and theme dictionary models, the proposed method will lead to a powerful multi-level (Chinese character level, word/phrase level, theme level, topic level) analysis platform for Chinese texts.With the explosive growth of the internet and digital technologies, large quantities of digitalized Chinese texts can be easily collected. For example, lots of Chinese historical documents written in traditional Chinese are now available in digital form; and, public media such as new papers, forums, blogs and microblogs, are producing huge amounts of Chinese text every day. Thus there is great appeal in developing text mining tools to automatically extract information from these data and create new knowledge. The ideas and approaches in this project may have significant impacts on how Chinese history will be studied. An efficient and reliable method for extracting information from the ever growing databases of digitized historical documents will enable researchers to analyze change over time based on large numbers of disaggregated data points, something impractical in the past. Furthermore, although originally designed for Chinese, these approaches have the potential to be applied to other Asian languages similar to Chinese, such as Japanese and Korean, and thus provide a powerful multi-language platform for the study of Asian history. In addition, the novel way of combatting the challenges in recognizing named entities studied in this project also has the potential to be extended to alphabetical languages such as English. Finally, the ideas and approaches studied in this project have the potential to be generalized into a systematic tool that digests any data flow of Chinese texts, and outputs a structured database that contains key information about the individuals and organizations described by the input data, thus making it easier for researchers to discover social network of all kinds of "units" in our social life. Various item association patterns discovered by our algorithms are also invaluable to the study of public media and sociology, and may help reveal new important epidemiological events and societal trends in a timely fashion. These types of information can have important implications in business decision making and governmental policy making.
在该项目中,研究人员研究了从中文文本中提取信息的一系列挑战性问题,包括:(1)单词/短语发现,(2)文本细分,(3)技术术语识别,以及(4)技术术语之间的关联性发现。与英语等字母表不同,中文具有许多特殊属性:没有单词界限,没有明确的单词定义,传统上没有标点符号和独特的语法。因此,将直接用于字母语言开发的大多数方法直接应用于中文是有问题的。此外,在文献中分析中文文本的可用方法具有许多局限性。相反,调查人员提出了一个高级单词词典模型(AWDM),该模型可以同时实现单词发现,文本细分和技术术语识别,传统上是单独研究的。这个想法是首先列举所有候选单词候选单词,从而满足文本中的某些标准并将其分配给每个单词候选人一个潜在的单词类型标签,该标签代表不同类型的技术术语(例如名称,地址,办公室标题,时间标签,时间标签以及背景文本)和相应的单词使用情况。然后,给出了不同单词类型之间的马尔可夫依赖模型,以模拟文本的潜在语法和语义结构。借助培训数据(即已知技术术语列表),AWDM可以自动从单词候选人的巨大空间中自动选择最有意义的单词,不仅基于单词的内容,而且还基于单词的上下文来确定每个单词的单词类型,还可以根据语法和语义信息进行段落。与文献中现有的方法相比,由于语法和语义信息的联合建模以及单词发现,文本细分和技术术语识别的综合分析,AWDM的效率更高。结合其他文本挖掘工具,例如主题模型和主题词典模型,提出的方法将导致中文文本的强大多层次(中文角色级别,单词/短语级别,主题级别,主题级别,主题级别,主题级别)分析平台,互联网和数字技术的爆炸性增长,可以轻松收集大量数字化中文文本。例如,现在以数字形式获得了许多用传统汉语编写的中国历史文件。而且,诸如新论文,论坛,博客和微博的公共媒体每天都会生产大量的中文文字。因此,开发文本挖掘工具以自动从这些数据中提取信息并创建新知识具有很大的吸引力。该项目中的思想和方法可能会对中国历史的研究方式产生重大影响。一种有效且可靠的方法,用于从不断增长的数字化历史文档数据库中提取信息,这将使研究人员能够根据大量分类的数据点来分析随着时间的变化,过去是不切实际的。此外,尽管最初是为中国设计的,但这些方法有可能应用于类似于日语和韩语等中文类似的其他亚洲语言,因此为研究亚洲历史提供了强大的多语言平台。此外,应对认识到该项目所研究的命名实体的挑战的新颖方法也有可能扩展到英语等字母语言。最后,该项目中研究的思想和方法有可能被推广到一个系统的工具中,该工具消化了中文文本的任何数据流,并输出一个结构化数据库,其中包含有关输入数据所描述的个人和组织的关键信息,从而使研究人员更容易在我们社会生活中发现各种“单位”的社交网络。我们的算法发现的各种项目关联模式对公共媒体和社会学的研究也很宝贵,并且可能有助于及时揭示新的重要流行病学事件和社会趋势。这些类型的信息可能对业务决策和政府政策制定具有重要意义。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Jun Liu其他文献

Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
在 GPU 上启用快速 2 位 LLM:内存对齐、稀疏异常值和异步反量化
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jinhao Li;Shiyao Li;Jiaming Xu;Shan Huang;Yaoxiu Lian;Jun Liu;Yu Wang;Guohao Dai
  • 通讯作者:
    Guohao Dai
Atrial fibrillation: rhythm control offers no advantage over rate control for some, but not all.
心房颤动:对于某些人(但不是全部)来说,节律控制并不比心率控制有任何优势。
  • DOI:
  • 发表时间:
    2007
  • 期刊:
  • 影响因子:
    4.7
  • 作者:
    Yan Bo Li;C. Hu;Jun Liu;Yuan Xiu Chen;Zhe Qu;Jia Xu;Jiang;Jun Wan;Qi;Congxin Huang
  • 通讯作者:
    Congxin Huang
Independent Relationship of Lipoprotein(a) and Carotid Atherosclerosis With Long-Term Risk of Cardiovascular Disease.
脂蛋白(a)和颈动脉粥样硬化与心血管疾病长期风险的独立关系。
  • DOI:
    10.1161/jaha.123.033488
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    5.4
  • 作者:
    Y. Qi;Youling Duan;Q. Deng;Na Yang;Jia;Jiangtao Li;Piaopiao Hu;Jun Liu;Jing Liu
  • 通讯作者:
    Jing Liu
Bridge risk assessment using a hybrid AHP/DEA methodology - art. no. 1493
使用混合 AHP/DEA 方法进行桥梁风险评估 - 艺术。
  • DOI:
    10.2991/iske.2007.266
  • 发表时间:
    2007
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yun Wang;Jun Liu;Tms Elhag;L. M. López
  • 通讯作者:
    L. M. López
Studies on the Hot Forming and Cold-Die Quenching of AA6082 Tailor Welded Blanks
AA6082拼焊板热成型及冷模淬火研究
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jun Liu;Ailin Wang;Haoxiang Gao;Omer El Fakir;X. Luan;Li Liang Wang;Jianguo Lin
  • 通讯作者:
    Jianguo Lin

Jun Liu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Jun Liu', 18)}}的其他基金

REU Site: Molecular Biology and Genetics of Cell Signaling
REU 网站:细胞信号传导的分子生物学和遗传学
  • 批准号:
    2349577
  • 财政年份:
    2024
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
SCC-PG: Building a smart and connected rural community for improved healthcare access through the deployment of integrated mobility solutions
SCC-PG:通过部署集成移动解决方案,建设智能互联的农村社区,改善医疗保健服务
  • 批准号:
    2303284
  • 财政年份:
    2023
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
Collaborative Research: Bayesian and Semi-Bayesian Methods for Detecting Relationships in High Dimensions
合作研究:用于检测高维关系的贝叶斯和半贝叶斯方法
  • 批准号:
    2015411
  • 财政年份:
    2020
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
Domain-Engineering Enabled Thermal Switching in Ferroelectric Materials
领域工程支持铁电材料中的热开关
  • 批准号:
    2011978
  • 财政年份:
    2020
  • 资助金额:
    $ 40万
  • 项目类别:
    Continuing Grant
REU Site: Molecular Biology and Genetics of Cell Signaling
REU 网站:细胞信号传导的分子生物学和遗传学
  • 批准号:
    1950247
  • 财政年份:
    2020
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
CAREER: Pushing the Lower Limit of Thermal Conductivity in Layered Materials
事业:突破层状材料导热率的下限
  • 批准号:
    1943813
  • 财政年份:
    2020
  • 资助金额:
    $ 40万
  • 项目类别:
    Continuing Grant
Collaborative Research: Novel Statistical Tools for Metagenomics and Metabolomics Data
合作研究:宏基因组学和代谢组学数据的新型统计工具
  • 批准号:
    1903139
  • 财政年份:
    2019
  • 资助金额:
    $ 40万
  • 项目类别:
    Continuing Grant
Travel Support for Student Participation at the 2019 ASME-IMECE Micro and Nano Technology Forum; Salt Lake City, Utah; November 10-14, 2019
为学生参加2019 ASME-IMECE微纳米技术论坛提供差旅支持;
  • 批准号:
    2000224
  • 财政年份:
    2019
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
Collaborative Research: Theoretical and Methodological Frameworks for Causal Inference of Peer Effects
合作研究:同伴效应因果推断的理论和方法框架
  • 批准号:
    1712714
  • 财政年份:
    2017
  • 资助金额:
    $ 40万
  • 项目类别:
    Standard Grant
Variable Selection via Inverse Modeling for Detecting Nonlinear Relationships
通过逆向建模进行变量选择以检测非线性关系
  • 批准号:
    1613035
  • 财政年份:
    2016
  • 资助金额:
    $ 40万
  • 项目类别:
    Continuing Grant

相似国自然基金

统计力学中的数学物理方程
  • 批准号:
    12371218
  • 批准年份:
    2023
  • 资助金额:
    43.5 万元
  • 项目类别:
    面上项目
半监督下最优个性化治疗方案的统计推断
  • 批准号:
    12301337
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
脉冲二氧化碳电催化体系的非平衡统计动力学
  • 批准号:
    22373090
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
环境混合污染物的健康效应统计分析方法研究
  • 批准号:
    82373690
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
大型复杂流数据的若干统计推断问题
  • 批准号:
    12371274
  • 批准年份:
    2023
  • 资助金额:
    43.5 万元
  • 项目类别:
    面上项目

相似海外基金

Executive functions in urban Hispanic/Latino youth: exposure to mixture of arsenic and pesticides during childhood
城市西班牙裔/拉丁裔青年的执行功能:童年时期接触砷和农药的混合物
  • 批准号:
    10751106
  • 财政年份:
    2024
  • 资助金额:
    $ 40万
  • 项目类别:
Time series clustering to identify and translate time-varying multipollutant exposures for health studies
时间序列聚类可识别和转化随时间变化的多污染物暴露以进行健康研究
  • 批准号:
    10749341
  • 财政年份:
    2024
  • 资助金额:
    $ 40万
  • 项目类别:
Fluency from Flesh to Filament: Collation, Representation, and Analysis of Multi-Scale Neuroimaging data to Characterize and Diagnose Alzheimer's Disease
从肉体到细丝的流畅性:多尺度神经影像数据的整理、表示和分析,以表征和诊断阿尔茨海默病
  • 批准号:
    10462257
  • 财政年份:
    2023
  • 资助金额:
    $ 40万
  • 项目类别:
Data Science and Statistics Core
数据科学和统计核心
  • 批准号:
    10549489
  • 财政年份:
    2023
  • 资助金额:
    $ 40万
  • 项目类别:
Novel Computational Methods for Microbiome Data Analysis in Longitudinal Study
纵向研究中微生物组数据分析的新计算方法
  • 批准号:
    10660234
  • 财政年份:
    2023
  • 资助金额:
    $ 40万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了