Sudachi: a Japanese Tokenizer for Business.

LREC(2018)

引用 24|浏览2
暂无评分
摘要
Tokenization, or morphological analysis, is a fundamental and important technology for processing a Japanese text, especially for industrial applications. However, we often face many obstacles, such as the inconsistency of token unit in different resources, notation variations, discontinued maintenance of the resources, and various issues with the existing tokenizer implementations. In order to improve this situation, we develop a tokenizer called Sudachi and its accompanying dictionary with features such as multi-granular output and normalization of notation variations. In addition to this, we continuously maintain our software and language resources in long-term as a part of the company business. We release the resulting tokenizer software and language resources freely available to the public as an open source software. You can access them at https://github.com/WorksApplications/Sudachi.
更多
查看译文
关键词
Tokenization, Morphological Analysis, Segmentation, Part-of-Speech Tagging, Lemmatization, Open Source Software
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要