Dataset 2: Recommended Language Models

user-5ebe28ba4c775eda72abcdf3(2019)

引用 0|浏览57
暂无评分
摘要
These word models were trained with a sentence start word of< s>, a sentence end word of, and an unknown word< unk>. The word vocabulary was the most frequent 64K words in the forum dataset that were also in a list of 330K known English words. All words are in lowercase. The character models are 12-gram models and were trained using interpolated Witten-Bell smoothing. The character model vocabulary consists of the lowercase letters az, apostrophe,< sp>; for a space,< s> for sentence start, and for sentence end.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要