Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word pairs are unobserved. In contrast to existing approaches often only sample a few unobserved word pairs as negative samples, we argue that the zero entries in the co-occurrence matrix also provide valuable information. We then design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.
在处理自然语言中,单词嵌入是许多下游应用程序中的关键组成部分。现有方法通常假设存在大量文本,以学习有效的单词嵌入。但是,这种语料库可能无法用于某些低资源语言。在本文中,我们研究了如何有效地学习只有几百万个令牌的语料库上的单词嵌入模型。在这种情况下,由于许多单词对的共发生,因此共发生矩阵稀疏。与现有方法相反,通常只采样了几个未观察到的单词对作为负样本,我们认为同时矩阵中的零条目也提供了有价值的信息。然后,我们设计了一种积极的未标记学习方法(PU-学习)方法,以分解同时出现矩阵并以四种不同语言验证所提出的方法。