An Approach for Similarity Vietnamese Documents Detection from English Documents.

FDSE (CCIS Volume)(2022)

引用 0|浏览4
暂无评分
摘要
Currently, many studies are measuring the similarity between documents in a specific language, such as Vietnamese - Vietnamese and English - English. However, situations have recently appeared in the problem of copying articles. For example, English sources have been translated into Vietnamese and edited into their manuscripts. As a result, it is considered cross-language plagiarism. Therefore, this study has applied a new approach: translate from English to Vietnamese documents, then calculate and compare the translated document with documents modified or copied from a translated document. In the study, the main focus is on stages such as Translating English documents into Vietnamese, preprocessing documents, and determining the similarity between documents. The determination of similarity between documents mentioned in this topic is Cosine similarity based on Term Frequency (TF), Inverse Document Frequency (IDF), and word order similarity in the text. Combine these two metrics to give a similar result that is more accurate and convincing. The data is collected in 7 topics with related topics with the number of 15 documents with lengths from 2000 to more than 8000 words, successfully built a document translation integration system based on Google Translate Application Programming Interface (API) and similarity checking, Precision and Recall measures show very positive results over 80%.
更多
查看译文
关键词
Similar document detection,Similarity,Cross plagiarism,Translation,Cosine similarity,Documents similarity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要