文本自动分类中样本重要性模型及应用研究

结题报告

项目介绍

AI项目解读

基本信息

批准号：
61272212
项目类别：
面上项目
资助金额：
70.0万
负责人：
王明文
依托单位：
江西师范大学
学科分类：
F0211.信息检索与社会计算
结题年份：
2016
批准年份：
2012
项目状态：
已结题
起止时间：
2013-01-01 至2016-12-31

项目参与者：
罗远胜；左家莉；揭安全；吴根秀；王晓庆；汤皖宁；马俏；廖亚男；胡海亮；
关键词：
对偶关系类边界文本自动分类特征选择样本重要性

项目摘要

Text automated categorization is important to analyze and organize the Internet data effectively. The main challenges of automated categorization are massive scale and high dimensionality of the data. A direct and effective approach is to reduce computing complexity using the sample reduction or dimensionality reduction, which can improve the classifier's generalization ability and without loss of classification performance. The most of sample selection methods are based on statistical sampling theory, in which the samples should obey independent identical distribution(iid). Boosting and large margin approaches imply the thought of sample selection, but they depend on the specific algorithms..Inspired by the theory of worked example in cognitive science,this project proposes sample importance principle. The sample importance is measured by the contribution of samples to classification without any statistical assumption . In order to derive sample importance model that is not depend on sepecific classifiers, we will provide the approaches of automatically identifying class boundaries in the training data set by using random process and high-dimensional data analysis theory to design the algorithms of computing sample importance and to give mathematical proof. For example, we can exploit a random walks algorithm to find the boundary set and to compute the boundariness for every sample..Futhermore, the sample importance will combine with existing machine learning methods to improve the performance. We will present some novelty methods for selecting features and samples by building the dual relationship between sample importance and feature importance. The work will provide new ideas and methods for text categorizaiton and general classificaiton in machine learning.

文本自动分类在有效分析和利用因特网数据方面有着重要作用，但这些数据的海量性和高维性是自动分类面临的主要难题。一种直接有效的解决途径是在保证学习算法分类性能的前提下，通过样本集约简或维数约简降低计算复杂性，并提高分类器的泛化能力。现有样本选择方法多基于统计抽样技术，需独立同分布假设；Boosting和最大间隔方法虽隐含样本选择思想，但依赖于具体的分类算法。本项目受认知科学中的样例理论启发，不对训练样本的分布做任何统计假设，从样本角度出发，根据样本对分类的贡献程度，提出样本重要性原理；拟应用随机过程和高维数据统计分析理论，给出训练集中类边界样本的自动判别方法，建立不依赖于具体分类器的样本重要性模型，研究样本重要性计算算法，并给出理论证明；结合已有分类算法，研究融合样本权重的分类算法；构建样本重要性与特征重要性的对偶关系，研究相应的特征选择和样本选择的新方法，为文本分类及一般分类问题提供新的思路

结项摘要

Web数据的海量性和高维性是自动分类面临的主要难题。一种直接有效的解决途径是在保证学习算法分类性能的前提下，通过样本集约简或维数约简降低计算复杂性，并提高分类器的泛化能力。本项目受认知科学中的样例理论启发，从样本角度出发，根据样本对分类或检索任务的贡献程度，提出样本重要性原理，并将其应用于文本分类和信息检索模型中。课题组成员共发表相关论文28篇，成功承办了第四届自然语言处理与中文计算国际会议（NLP&CC 2015）、第五届全国社会媒体处理大会（SMP2016）和中国计算机学会学科前沿讲习班第五十九期（CCF ADL 59），培养11名硕士，在读博士2名，邀请伊利诺伊大学香槟分校韩家炜教授、清华大学黄昌宁教授等知名学者来校讲学和交流。主要研究工作有：. 1、样本重要性模型研究：基于随机过程和高维数据统计分析理论，随机游走计算每个样本点的边界值，并计算样本重要性得分，建立不依赖于具体分类器的样本重要性模型；基于图论分析理论，构造Markov网络去度量样本的重要性，根据样本的特性进而使用团、层次依赖等关系分析样本的关联关系。. 2、样本重要性模型应用研究：将其应用于文本分类中，相应提出融合KNN方法新的分类模型—SI-KNN研究样本重要性计算算法；使用Markov网络刻画信息检索中的文档关系，通过文档团度量文档样本与查询的相关性、将文档样本划分为句子、计算双语主题的相关性、与查询相关的近邻文档构成文档关系图进行样本重要模型研究。. 3、基于样本重要性的特征选择方法研究：应用样本重要性指标计算文本特征的重要性程度，并进行特征选择；使用Markov网络中的词对文档和查询的贡献程度表示词的重要性，在层次依赖的Markov网络发现“重要”的词；使用稀疏编码对特征重构,通过Markov随机游走的方式构建特征之间的语义网络关系图，而后计算特征的重要性；提取Markov网络中的词团信息来量化词间的混合相关性。