Improved word vector space with ensemble deep learning model for language identification

Sādhanā(2024)

引用 0|浏览0
暂无评分
摘要
The process of determining native tongue of document is known as language identification. This work presents word level language identification of text as English or Hindi. Experimental analysis is performed on dataset collected from Twitter. In the first step, collected data is preprocessed by applying natural language processing techniques. Ensemble word embedding technique is proposed by ensembling four word embedding techniques namely, (i) Word2Vec, (ii) Embeddings from Language Model, (iii) Global Vectors, and (iv) FastText. Proposed word embedding approach is applied on preprocessed data to get enhanced word vector space for language identification. Finally, classification of text as Hindi or English is performed by four heterogeneous deep learning models namely, (i) Convolution Neural Network (CNN), (ii) Long Short Term Memory (LSTM), (iii) Hybrid model of CNN and LSTM, and (iv) Hybrid model of Bidirectional Long Short-Term Memory and Gated Recurrent Unit. Proposed hybrid model gives highest 96.05
更多
查看译文
关键词
Language identification,deep learning,word embedding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要