Multi-label Text Classification (MLTC) is the task of categorizing documents into one or more topics. Considering the large volumes of data and varying domains of such tasks, fully supervised learning requires manually fully annotated datasets which is costly and time-consuming. In this paper, we propose BERT-Flow-VAE (BFV), a Weakly-Supervised Multi-Label Text Classification (WSMLTC) model that reduces the need for full supervision. This new model (1) produces BERT sentence embeddings and calibrates them using a flow model, (2) generates an initial topic-document matrix by averaging results of a seeded sparse topic model and a textual entailment model which only require surface name of topics and 4-6 seed words per topic, and (3) adopts a VAE framework to reconstruct the embeddings under the guidance of the topic-document matrix. Finally, (4) it uses the means produced by the encoder model in the VAE architecture as predictions for MLTC. Experimental results on 6 multi-label datasets show that BFV can substantially outperform other baseline WSMLTC models in key metrics and achieve approximately 84% performance of a fully-supervised model.
多标签文本分类(MLTC)是将文档分类到一个或多个主题的任务。考虑到此类任务的数据量巨大且领域多样,完全监督学习需要人工对数据集进行完全标注,这既昂贵又耗时。在本文中,我们提出了BERT - Flow - VAE(BFV),一种弱监督多标签文本分类(WSMLTC)模型,它减少了对完全监督的需求。这个新模型(1)生成BERT句子嵌入,并使用流模型对其进行校准;(2)通过对一个有种子的稀疏主题模型和一个文本蕴含模型的结果求平均来生成一个初始的主题 - 文档矩阵,这两个模型仅需要主题的表面名称以及每个主题4 - 6个种子词;(3)采用变分自编码器(VAE)框架在主题 - 文档矩阵的指导下重构嵌入。最后,(4)它使用VAE架构中编码器模型产生的均值作为多标签文本分类的预测结果。在6个多标签数据集上的实验结果表明,BFV在关键指标上能够显著优于其他弱监督多标签文本分类的基线模型,并能达到完全监督模型约84%的性能。