As big data heads towards big knowledge, data management and machine learning techniques work together to address several interesting problems. In this paper, we address a problem in natural language processing that involves learning by mining from large text databases. More specifically, we deal with the problem of preposition prediction, especially for ESL (English as a second language) learners. Prepositions are function words that typically show a relationship between a noun or a pronoun and other elements of a sentence. They play a key role in determining the meaning of a sentence. Accurate prediction of correct prepositions in a sentence is a challenging job since preposition usage is one of the most subtle aspects of the English grammar, making it difficult for non-native speakers. This paper proposes an approach for preposition prediction called WordPrep based on which we build a tool. WordPrep relies on mining based on the words themselves rather than on their lexical or syntactic connotations. This addresses the challenges of prepositions appearing in idiomatic phrases or in different semantic contexts, due to which the actual words are better than their grammatical positions. Our proposed solution entails a direct data-driven approach to predict the missing preposition in a sentence by learning from matching tokens consisting of ngrams with words before and after the preposition. Using various searches and pattern-matching methods against a large number of database records from big text corpora, this approach predicts the missing preposition(s). We describe our pilot approach, tool implementation and experiments in this paper. This work is particularly helpful for pedagogical applications.
随着大数据迈向大知识,数据管理和机器学习技术协同解决几个有趣的问题。在本文中,我们探讨自然语言处理中的一个问题,该问题涉及从大型文本数据库中挖掘学习。更具体地说,我们处理介词预测问题,尤其是针对英语作为第二语言(ESL)的学习者。介词是功能词,通常表示名词或代词与句子其他成分之间的关系。它们在确定句子的意思方面起着关键作用。准确预测句子中正确的介词是一项具有挑战性的工作,因为介词的用法是英语语法中最微妙的方面之一,这对非母语人士来说很困难。本文提出一种介词预测方法,称为WordPrep,并基于此构建了一个工具。WordPrep依靠基于单词本身的挖掘,而非其词汇或句法内涵。这解决了介词出现在习语短语或不同语义语境中的挑战,因为在这些情况下实际的单词比其语法位置更重要。我们提出的解决方案需要一种直接的数据驱动方法,通过从由介词前后的单词组成的n元语法匹配标记中学习,来预测句子中缺失的介词。通过针对大型文本语料库中的大量数据库记录使用各种搜索和模式匹配方法,这种方法可以预测缺失的介词。我们在本文中描述了我们的初步方法、工具实现和实验。这项工作对教学应用特别有帮助。