Repetition and Language Models and Comparable Corpora.

BUCC '09: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora(2011)

引用 1|浏览16
暂无评分
摘要
I will discuss a couple of non-standard features that I believe could be useful for working with comparable corpora. Dotplots have been used in biology to find interesting DNA sequences. Biology is interested in ordered matches, which show up as (possibly broken) diagonals in dot-plots. Information Retrieval is more interested in unordered matches ( e.g. , cosine similarity), which show up as squares in dotplots. Parallel corpora have both squares and diagonals multiplexed together. The diagonals tell us what is a translation of what, and the squares tell us what is in the same language. I would expect dotplots of comparable corpora would contain lots of diagonals and squares, though the diagonals would be shorter and more subtle in comparable corpora than in parallel corpora.
更多
查看译文
关键词
comparable corpus,parallel corpus,Information Retrieval,cosine similarity,interesting DNA sequence,non-standard feature,unordered match,language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要