An Overview of Microsoft Web N-gram Corpus and Applications.

HLT-DEMO '10: Proceedings of the NAACL HLT 2010 Demonstration Session(2010)

引用 61|浏览108
暂无评分
摘要
This document describes the properties and some applications of the Microsoft Web N-gram corpus. The corpus is designed to have the following characteristics. First, in contrast to static data distribution of previous corpus releases, this N-gram corpus is made publicly available as an XML Web Service so that it can be updated as deemed necessary by the user community to include new words and phrases constantly being added to the Web. Secondly, the corpus makes available various sections of a Web document, specifically, the body, title, and anchor text, as separates models as text contents in these sections are found to possess significantly different statistical properties and therefore are treated as distinct languages from the language modeling point of view. The usages of the corpus are demonstrated here in two NLP tasks: phrase segmentation and word breaking.
更多
查看译文
关键词
Microsoft Web N-gram corpus,N-gram corpus,previous corpus release,Web document,XML Web Service,anchor text,available various section,text content,NLP task,different statistical property,Microsoft web N-gram corpus
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要