Novel statistical models for text mining with applications to Chinese history and texts

用于文本挖掘的新颖统计模型及其在中国历史和文本中的应用

基本信息

批准号：
1208771
负责人：
Jun Liu
金额：
$ 40万
依托单位：
Harvard University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2012
资助国家：
美国
起止时间：
2012-07-01 至 2016-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1208771&HistoricalAwards=false
关键词：
Novel statistical models text mining

项目摘要

In this project, the investigators study a series of challenging problems of extracting information from Chinese text, including: (1) word/phrase discovery, (2) text segmentation, (3) technical term recognition, and (4) association discovery among technical terms. Different from alphabetical languages such as English, Chinese has many special properties: no word boundaries, no clear definition of words, traditionally no punctuation, and a unique grammar. Thus, it is problematic to apply most methods developed for alphabetical languages directly to Chinese. Moreover, the available methods for analyzing Chinese text in the literature have many limitations. Instead the investigators propose an advanced word dictionary model (AWDM) that can simultaneously achieve word discovery, text segmentation and technical term recognition, which are traditionally studied separately. The idea is to build up a word dictionary first by enumerating all word candidates satisfying a certain criterion from the texts and assign to each word candidate a latent word type label representing different types of technical terms (such as names, addresses, office titles, time labels, as well as background texts) and corresponding word usage frequencies. Then, a Markov dependence model among different words and word types is given to model the potential grammatical and semantic structure of the texts. With the help of the training data (i.e., lists of known technical terms), the AWDM can automatically select the most meaningful words from the huge space of word candidates, determine the word type for each word based on not only the content of the word but also the context around the word, and segment the texts based on both grammatical and semantic information. Compared to the existing methods in the literature, the AWDM enjoys a better efficiency due to the joint modeling of the grammatical and semantic information and the integrated analysis of word discovery, text segmentation and technical term recognition. Combined with other text mining tools, such as topic models and theme dictionary models, the proposed method will lead to a powerful multi-level (Chinese character level, word/phrase level, theme level, topic level) analysis platform for Chinese texts.With the explosive growth of the internet and digital technologies, large quantities of digitalized Chinese texts can be easily collected. For example, lots of Chinese historical documents written in traditional Chinese are now available in digital form; and, public media such as new papers, forums, blogs and microblogs, are producing huge amounts of Chinese text every day. Thus there is great appeal in developing text mining tools to automatically extract information from these data and create new knowledge. The ideas and approaches in this project may have significant impacts on how Chinese history will be studied. An efficient and reliable method for extracting information from the ever growing databases of digitized historical documents will enable researchers to analyze change over time based on large numbers of disaggregated data points, something impractical in the past. Furthermore, although originally designed for Chinese, these approaches have the potential to be applied to other Asian languages similar to Chinese, such as Japanese and Korean, and thus provide a powerful multi-language platform for the study of Asian history. In addition, the novel way of combatting the challenges in recognizing named entities studied in this project also has the potential to be extended to alphabetical languages such as English. Finally, the ideas and approaches studied in this project have the potential to be generalized into a systematic tool that digests any data flow of Chinese texts, and outputs a structured database that contains key information about the individuals and organizations described by the input data, thus making it easier for researchers to discover social network of all kinds of "units" in our social life. Various item association patterns discovered by our algorithms are also invaluable to the study of public media and sociology, and may help reveal new important epidemiological events and societal trends in a timely fashion. These types of information can have important implications in business decision making and governmental policy making.

在该项目中，研究人员研究了从中文文本中提取信息的一系列挑战性问题，包括：（1）单词/短语发现，（2）文本细分，（3）技术术语识别，以及（4）技术术语之间的关联性发现。与英语等字母表不同，中文具有许多特殊属性：没有单词界限，没有明确的单词定义，传统上没有标点符号和独特的语法。因此，将直接用于字母语言开发的大多数方法直接应用于中文是有问题的。此外，在文献中分析中文文本的可用方法具有许多局限性。相反，调查人员提出了一个高级单词词典模型（AWDM），该模型可以同时实现单词发现，文本细分和技术术语识别，传统上是单独研究的。这个想法是首先列举所有候选单词候选单词，从而满足文本中的某些标准并将其分配给每个单词候选人一个潜在的单词类型标签，该标签代表不同类型的技术术语（例如名称，地址，办公室标题，时间标签，时间标签以及背景文本）和相应的单词使用情况。然后，给出了不同单词类型之间的马尔可夫依赖模型，以模拟文本的潜在语法和语义结构。借助培训数据（即已知技术术语列表），AWDM可以自动从单词候选人的巨大空间中自动选择最有意义的单词，不仅基于单词的内容，而且还基于单词的上下文来确定每个单词的单词类型，还可以根据语法和语义信息进行段落。与文献中现有的方法相比，由于语法和语义信息的联合建模以及单词发现，文本细分和技术术语识别的综合分析，AWDM的效率更高。结合其他文本挖掘工具，例如主题模型和主题词典模型，提出的方法将导致中文文本的强大多层次（中文角色级别，单词/短语级别，主题级别，主题级别，主题级别，主题级别）分析平台，互联网和数字技术的爆炸性增长，可以轻松收集大量数字化中文文本。例如，现在以数字形式获得了许多用传统汉语编写的中国历史文件。而且，诸如新论文，论坛，博客和微博的公共媒体每天都会生产大量的中文文字。因此，开发文本挖掘工具以自动从这些数据中提取信息并创建新知识具有很大的吸引力。该项目中的思想和方法可能会对中国历史的研究方式产生重大影响。一种有效且可靠的方法，用于从不断增长的数字化历史文档数据库中提取信息，这将使研究人员能够根据大量分类的数据点来分析随着时间的变化，过去是不切实际的。此外，尽管最初是为中国设计的，但这些方法有可能应用于类似于日语和韩语等中文类似的其他亚洲语言，因此为研究亚洲历史提供了强大的多语言平台。此外，应对认识到该项目所研究的命名实体的挑战的新颖方法也有可能扩展到英语等字母语言。最后，该项目中研究的思想和方法有可能被推广到一个系统的工具中，该工具消化了中文文本的任何数据流，并输出一个结构化数据库，其中包含有关输入数据所描述的个人和组织的关键信息，从而使研究人员更容易在我们社会生活中发现各种“单位”的社交网络。我们的算法发现的各种项目关联模式对公共媒体和社会学的研究也很宝贵，并且可能有助于及时揭示新的重要流行病学事件和社会趋势。这些类型的信息可能对业务决策和政府政策制定具有重要意义。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Jun Liu其他文献

Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization

在 GPU 上启用快速 2 位 LLM：内存对齐、稀疏异常值和异步反量化

DOI：
发表时间：
2023
期刊：
arXiv.org
影响因子：
0
作者：
Jinhao Li;Shiyao Li;Jiaming Xu;Shan Huang;Yaoxiu Lian;Jun Liu;Yu Wang;Guohao Dai
通讯作者：
Guohao Dai

Atrial fibrillation: rhythm control offers no advantage over rate control for some, but not all.

心房颤动：对于某些人（但不是全部）来说，节律控制并不比心率控制有任何优势。

DOI：
发表时间：
2007
期刊：
Medical Hypotheses
影响因子：
4.7
作者：
Yan Bo Li;C. Hu;Jun Liu;Yuan Xiu Chen;Zhe Qu;Jia Xu;Jiang;Jun Wan;Qi;Congxin Huang
通讯作者：
Congxin Huang

Independent Relationship of Lipoprotein(a) and Carotid Atherosclerosis With Long-Term Risk of Cardiovascular Disease.

脂蛋白（a）和颈动脉粥样硬化与心血管疾病长期风险的独立关系。

DOI：
10.1161/jaha.123.033488
发表时间：
2024
期刊：
Journal of the American Heart Association
影响因子：
5.4
作者：
Y. Qi;Youling Duan;Q. Deng;Na Yang;Jia;Jiangtao Li;Piaopiao Hu;Jun Liu;Jing Liu
通讯作者：
Jing Liu