Research on Uyghur morphological segmentation based on long sequence labeling method.

Ruohao Yan,Huaping Zhang,Wushour Silamu, Askar Hamdulla

International Conference on Signal Processing and Machine Learning (SPML)(2022)

引用 0|浏览3
暂无评分
摘要
With the steady progress of the "One Belt, One Road" national cooperation initiative, the intelligent processing of languages along the route has become increasingly important for communication, and Uyghur is a representative language of agglutinative language. The Uyghur language comprises stems and affixes, and the data is sparse. Morphological segmentation separates Uyghur roots and affixes to solve the problem of data sparseness. First, This paper studies the characteristics of the Uyghur morphological segmentation task and proposes a long sequence labeling method. Secondly, BiLSTM networks learn word formation features, and then the CRF model is used to learn label features. Finally, it proposes a new evaluation method. This paper reproduces relevant research and conducts experiments on the public THUUyMorph corpus, and the model F1 value is 98.60%. Experiments show that the results of this paper are better than the current advanced Uyghur morphological segmentation model, and downstream task Uyghur-Chinese translation experiments prove its effectiveness. This scheme can transfer to other languages along this line, such as Turkish, which provides a new research idea for morphological segmentation.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要