Improved Cross-Lingual Document Similarity Measurement

Udhan Isuranga, Janaka Sandaruwan, Udesh Athukorala,Gihan Dias

2020 International Conference on Asian Language Processing (IALP)(2020)

引用 0|浏览3
暂无评分
摘要
We present an efficient and effective system to identify similar documents in the target language for a given document in the source language. For our work, we used source and target documents from the Sinhala and English languages. However, the system can be extended to any other languages for which suitable embeddings exist. We have improved both accuracy and speed compared with the current state-of-the-art. We have compiled a corpus of possible target documents in each of the two languages of interest. For a source document, we compute the distance between it and each of the documents in the corpus using their sentence embeddings. We used nearest neighbor retrieval to speed up the matching by restricting the set of target documents searched for a given source document. We used a scoring function and matching algorithm to properly pair the identified sentences. To improve accuracy, we used number matching and named entity matching.
更多
查看译文
关键词
Approximate Nearest Neighbor (ANN),Bilingual Embeddings,Cross-Lingual Similarity,Document Alignment,Low-Resource Languages,Machine Translation (MT),Natural Language Processing (NLP)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要