Unicode Normalization and Grapheme Parsing of Indic Languages
arxiv(2023)
摘要
Writing systems of Indic languages have orthographic syllables, also known as
complex graphemes, as unique horizontal units. A prominent feature of these
languages is these complex grapheme units that comprise consonants/consonant
conjuncts, vowel diacritics, and consonant diacritics, which, together make a
unique Language. Unicode-based writing schemes of these languages often
disregard this feature of these languages and encode words as linear sequences
of Unicode characters using an intricate scheme of connector characters and
font interpreters. Due to this way of using a few dozen Unicode glyphs to write
thousands of different unique glyphs (complex graphemes), there are serious
ambiguities that lead to malformed words. In this paper, we are proposing two
libraries: i) a normalizer for normalizing inconsistencies caused by a
Unicode-based encoding scheme for Indic languages and ii) a grapheme parser for
Abugida text. It deconstructs words into visually distinct orthographic
syllables or complex graphemes and their constituents. Our proposed normalizer
is a more efficient and effective tool than the previously used IndicNLP
normalizer. Moreover, our parser and normalizer are also suitable tools for
general Abugida text processing as they performed well in our robust word-based
and NLP experiments. We report the pipeline for the scripts of 7 languages in
this work and develop the framework for the integration of more scripts.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要