Constructing A Turkish Corpus For Paraphrase Identification And Semantic Similarity

COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT I(2018)

引用 6|浏览5
暂无评分
摘要
The Paraphrase identification (PI) task has practical importance for work in Natural Language Processing (NLP) because of the problem of linguistic variation. Accurate methods should help improve performance of key NLP applications. Paraphrase corpora are important resources in developing and evaluating PI methods. This paper describes the construction of a paraphrase corpus for Turkish. The corpus comprises pairs of sentences with semantic similarity scores based on human judgments, permitting experimentation with both PI and semantic similarity. We believe this is the first such corpus for Turkish. The data collection and scoring methodology is described and initial PI experiments with the corpus are reported. Our approach to PI is novel in using 'knowledge lean' methods (i.e. no use of manually constructed knowledge bases or processing tools that rely on these). We have previously achieved excellent results using such techniques on the Microsoft Research Paraphrase Corpus, and close to state-of-the-art performance on the Twitter Paraphrase Corpus.
更多
查看译文
关键词
Paraphrase identification, Turkish, Corpora construction, Knowledge-lean, Paraphrasing, Sentential semantic similarity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要