谷歌浏览器插件
订阅小程序
在清言上使用

A Hybrid Part-of-speech Tagger with Annotated Kurdish Corpus: Advancements in POS Tagging

Dastan Maulud,Karwan Jacksi,Ismael Ali

DIGITAL SCHOLARSHIP IN THE HUMANITIES(2023)

引用 0|浏览3
暂无评分
摘要
With the rapid growth of online content written in the Kurdish language, there is an increasing need to make it machine-readable and processable. Part of speech (POS) tagging is a critical aspect of natural language processing (NLP), playing a significant role in applications such as speech recognition, natural language parsing, information retrieval, and multiword term extraction. This study details the creation of the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect. The corpus, containing 74,258 words and thirty-eight tags, employs a hybrid approach utilizing the bigram hidden Markov model in combination with the Kurdish rule-based approach to POS tagging. This approach addresses two key problems that arise with rule-based approaches, namely misclassified words and ambiguity-related unanalyzed words. The proposed approach's accuracy was assessed by training and testing it on the DASTAN corpus, yielding a 96% accuracy rate. Overall, this study's findings demonstrate the effectiveness of the proposed hybrid approach and its potential to enhance NLP applications for Sorani Kurdish.
更多
查看译文
关键词
machine readability,natural language processing,part of speech tagging,text corpus,bigram HMM,rule-based approach,speech recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要