On the Entropy of Written Afan Oromo

Woldegebreal Dereje Hailemariam,Debella Tsegamlak Terefe, Molla Kalkidan Dejenie

e-Infrastructure and e-Services for Developing Countries（2022）

引用 0|浏览0

暂无评分

摘要

Afan Oromo is the language of the Oromo people, the largest ethnolinguistic group in Ethiopia. Written Afan Oromo uses Latin alphabet. In electronic communication systems letters in the alphabet are represented with standard ASCII-8 code, which uses 8 bits/letter, or UTF-8 fixed length encoding, which uses 16 bits/letter. Moreover, the language uses gemination (i.e., doubling of a consonant) and long vowels are represented by double letters, e.g., “dammee” to mean sweet potato. From information theoretic perspective, this doubling and fixed length encoding schemes add redundancy in written Afan Oromo. This redundancy, in turn, contributes for inefficient use of communication resources, such as bandwidth and energy, during transmission and storage of texts written in Afan Oromo. This paper aims at utilizing information theory to estimate entropy of written Afan Oromo. We use higher-order Markov chain, also called N-gram model, to compute the entropy of a sample text corpora (or written source) by capturing the dependencies among sequence of letters generated from the corpora. Entropy measures average information in bits per letter or block of letters, depending on the N-gram considered. Entropy also indicates the achievable lower bound for compression when using lossless compressions such as Huffman coding. When modeled as a first order Markov chain (i.e., assuming memoryless source where sequence of letters from the source are occurring independent of each other), the entropy of the language is 4.31 bits/letter. When compared with ASCII-8, the achievable compression level is about 46%. When N = 19 the estimated entropy is as low as 0.85 bits/letter; this corresponds to about 89% compression level. Huffman and Arithmetic source coding algorithms are implemented to check the achievable compression level. For the collected sample corpora, the average compression by Huffman algorithm varies from 42.2%−64.9% for N = 1 − 5. These compression levels are closer to the theoretical entropy. With increasing demand of the language in telecom services and storage systems, the entropy results show the need to further investigate language specific applications, like compression algorithms.

查看译文

关键词

Compression, Entropy, Encoding, Written Afan Oromo

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要