Translation equivalent pairs of named entities have very important application value in cross - language information processing. However, due to the limitation of corpus resources, the extraction methods of Chinese - Khmer named entity equivalent pairs have not been studied in depth at home and abroad. Starting from comparable corpus texts, according to the characteristics of different types of entity elements and their characteristics in comparable corpora, this paper selects the transliteration features, translation features of Khmer named entities to Chinese named entities, the context features of named entities in comparable corpora and their own length features, and proposes a method of calculating similarity based on multi - feature fusion to mine Chinese - Khmer bilingual named entity equivalent pairs. Experiments show that this method has achieved relatively good results. Among them, the accuracy rate of mining person name entity pairs reaches 76%, and the recall rate reaches 66%, which proves that this method is superior to the method that only uses a single feature.
命名实体翻译等价对在跨语言信息处理中具有非常重要的应用价值,然而由于语料资源的有限性,国内外关于汉柬命名实体等价对的抽取方法还没有深入研究。论文从可比语料文本出发,根据不同类型实体要素的特点以及在可比语料中的特点,选取了柬文命名实体到中文命名实体的音译特征、翻译特征、可比语料中命名实体的上下文特征及自身的长度特征,提出了一种基于多特征融合来计算相似度的方法来挖掘汉柬双语命名实体等价对。实验表明该方法取得了比较好的效果,其中挖掘人名实体对的准确率达到76%,召回率达到66%,证明了该方法要优于只采用单一特征的方法。