AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
ParsBERT is a fresh model that is lighter than multilingual Bidirectional Encoder Representation Transformer and represents state-of-the-art results in downstream tasks, such as Sentiment Analysis, Text Classification, and Named Entity Recognition

ParsBERT: Transformer-based Model for Persian Language Understanding

Cited by: 0|Views2
Full Text
Bibtex
Weibo

Abstract

The surge of pre-trained language models has begun a new era in the field of Natural Language Processing (NLP) by allowing us to build powerful language models. Among these models, Transformer-based models such as BERT have become increasingly popular due to their state-of-the-art performance. However, these models are usually focused o...More

Code:

Data:

0
Introduction
  • Natural language is the tool humans use to communicate with each other. a vast amount of data is encoded as texts using this tool.
  • Word2Vec [1] and GloVe [2] are pre-trained word embeddings methods based on Neural Networks (NNs) that investigate the semantic, syntactic, and logical relationships between words in a sequence to provide a static word representation vectors, based on the training data.
  • While these methods leave the context of the input sequence out of the equation, contextualized word embedding methods such as ELMo [3] provide dynamic word embeddings by taking the context into account
Highlights
  • Natural language is the tool humans use to communicate with each other
  • It can be seen that ParsBERT achieves significantly higher F1 scores for both multi-class and binary sentiment analysis compared to methods mentioned in DeepSentiPers [38]
  • It can be seen that ParsBERT achieves better accuracy and scores compared to multilingual Bidirectional Encoder Representation Transformer (BERT) model on both Digikala Magazine and Persian news datasets
  • Obtained results for Named Entity Recognition (NER) task indicates that ParsBERT outperforms all prior works in this area by achieving F1 scores as high as 93.10 and 98.79 for PEYMA and ARMAN datasets, respectively
  • ParsBERT is a fresh model that is lighter than multilingual BERT and represents state-of-the-art results in downstream tasks, such as Sentiment Analysis, Text Classification, and Named Entity Recognition
  • We happily announce that ParsBERT synchronizes to Huggingface Transformers for any public use and to serve as a new baseline for numerous Persian Natural Language Processing (NLP) use cases
Results
  • Table 4 shows the results obtained on Digikala and SnaooFood datasets
  • The authors show that ParsBERT outperforms the multilingual BERT model in terms of accuracy and F1 score.
  • It can be seen that ParsBERT achieves better accuracy and scores compared to multilingual BERT model on both Digikala Magazine and Persian news datasets.
  • Obtained results for NER task indicates that ParsBERT outperforms all prior works in this area by achieving F1 scores as high as 93.10 and 98.79 for PEYMA and ARMAN datasets, respectively.
Conclusion
  • ParsBERT successfully achieves state-of-the-art performance on all mentioned downstream tasks
  • This conclusively proves that monolingual language models outmatch multilingual ones.
  • The range of topics and writing styles included in the pre-training dataset is much more diverse than that of multilingual BERT that only applies the Wikipedia dataset
  • Another limitation of the multilingual model caused by using the small Wikipedia corpus is that it contains a vocab size of 70K tokens for all 100 languages it supports.
  • The authors happily announce that ParsBERT synchronizes to Huggingface Transformers for any public use and to serve as a new baseline for numerous Persian NLP use cases
Tables
  • Table1: Statistics and types of each source in the proposed corpus, entailing a varied range of written styles
  • Table2: Statistics of the pre-training corpus
  • Table3: Example of the segmentation process: (1) unsegmented sentence (2) segmented sentence using WordPiece method ( interpret as -)
  • Table4: ParsBERT performance on Digikala and SnappFood datasets compared to multilingual BERT model
  • Table5: ParsBERT performance on DeepSentiPers dataset compared to methods mentioned in DeepSentiPers [<a class="ref-link" id="c38" href="#r38">38</a>]
  • Table6: ParsBERT performance on text classification task compared to multilingual BERT model
  • Table7: ParsBERT performance on PEYMA and ARMAN datasets for the NER task compared to prior works
Download tables as Excel
Related work
  • 2.1 Language Modelling

    Language modeling has gained popularity in recent years, and many works have been dedicated to building models for different languages based on varying contexts. Some works have sought to build character-level models. For example, a character-level model with Recurrent Neural Network (RNN) is presented in [11]. This model reasons about word spelling and grammar dynamically. Another multi-task character-level attentional network model for the medical concept has been used to address Out-Of-Vocabulary (OOV) problem and to sustain morphological information inside the concept [12].

    Contextualized language modeling is centered around the idea that words can be represented differently based on the context in which they appear. Encoder-decoder language models, sequence autoencoders, and sequence-to-sequence models have this concept [13, 14, 15]. ELMo and ULMFiT [16] are contextualized language models pre-trained on large general domain corpora. They are both based on LSTM networks [17]; ULMFiT benefits from a regular multi-layer LSTM network while ELMo utilizes a bidirectional LSTM structure to predict both next and previous words in a sequence of words. It then composes the final embedding for each token by concatenating the left-to-right and the right-to-left representations. Both ULMFiT and ELMo show considerable improvement in downstream tasks as compared to preceding language models and word embedding methods.
Funding
  • A Masked Language Model (MLM) is employed to train the model to predict randomly masked tokens by using cross-entropy loss. For this purpose given N tokens, 15% of them are selected at random. From these selected tokens, 80% of them are replaced by an exclusive [MASK] token, 10% are replaced with a random token, and 10% remain unchanged
Study subjects and analysis
sentiment datasets: 3
It aims to classify text, such as comments based on their emotional bias. The proposed model is evaluated on three sentiment datasets as follows: 1. Digikala user comments provided by Open Data Mining Program 9 (ODMP)

sentiment datasets: 3
We extracted it using our tools to provide a more comprehensive evaluation. Figure 3 illustrates the class distribution for all three sentiment datasets. Baselines: Since no work has been done regarding the Digikala and SnappFood datasets, our baseline for these datasets is the multilingual BERT model

articles: 8515
The datasets used for this task come from two sources: 1. A total of 8,515 articles scraped from Digikala online magazine 11. This dataset includes seven different classes

Reference
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. ArXiv, abs/1310.4546, 2013.
    Findings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. ArXiv, abs/1802.05365, 2018.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
    Findings
  • Alec Radford. Improving language understanding by generative pre-training. In OpenAI, 2018.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2019.
    Findings
  • Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, F. Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. ArXiv, abs/1911.02116, 2019.
    Findings
  • Weihua Wang, Feilong Bao, and Guanglai Gao. Learning morpheme representation for mongolian named entity recognition. Neural Processing Letters, pages 1–18, 2019.
    Google ScholarLocate open access versionFindings
  • Gengshi Huang and Haifeng Hu. c-rnn: A fine-grained language model for image captioning. Neural Processing Letters, 49:683–691, 2018.
    Google ScholarLocate open access versionFindings
  • Jinghao Niu, Yehui Yang, Siheng Zhang, Zhengya Sun, and Wensheng Zhang. Multi-task character-level attentional networks for medical concept normalization. Neural Processing Letters, 49:1239–1256, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. ArXiv, abs/1511.01432, 2015.
    Findings
  • Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. Unsupervised pretraining for sequence to sequence learning. ArXiv, abs/1611.02683, 2016.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. ArXiv, abs/1409.3215, 2014.
    Findings
  • Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural Computation, 9:1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ArXiv, abs/1706.03762, 2017.
    Findings
  • Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. ArXiv, abs/1901.07291, 2019.
    Findings
  • Zhen-Zhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2020.
    Findings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLP@EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. ArXiv, abs/1806.03822, 2018.
    Findings
  • Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina Nissim. Bertje: A dutch bert model. ArXiv, abs/1912.09582, 2019.
    Findings
  • Marco Polignano, Pierpaolo Basile, Marco Degemmis, Giovanni Semeraro, and Valerio Basile. Alberto: Italian bert language understanding model for nlp challenging tasks based on tweets. In CLiC-it, 2019.
    Google ScholarLocate open access versionFindings
  • Wissam Antoun, Fady Baly, and Hazem M. Hajj. Arabert: Transformer-based model for arabic language understanding. ArXiv, abs/2003.00104, 2020.
    Findings
  • Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. Multilingual is not enough: Bert for finnish. ArXiv, abs/1912.07076, 2019.
    Findings
  • Yuri Kuratov and Mikhail Arkhipov. Adaptation of deep bidirectional multilingual transformers for russian language. ArXiv, abs/1905.07213, 2019.
    Findings
  • Fabio Barbosa de Souza, Rodrigo Nogueira, and Roberto de Alencar Lotufo. Portuguese named entity recognition using bert-crf. ArXiv, abs/1909.10649, 2019.
    Findings
  • Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. ArXiv, abs/1802.06893, 2018.
    Findings
  • Mohammad Sadegh Zahedi, Mohammad Hadi Bokaei, Farzaneh Shoeleh, Mohammad Mehdi Yadollahi, Ehsan Doostmohammadi, and Mojgan Farhoodi. Persian word embedding evaluation benchmarks. Electrical Engineering (ICEE), Iranian Conference on, pages 1583–1588, 2018.
    Google ScholarLocate open access versionFindings
  • Seyed Habib Hosseini Saravani, Mohammad Bahrani, Hadi Veisi, and Sara Besharati. Persian language modeling using recurrent neural networks. 2018 9th International Symposium on Telecommunications (IST), pages 207–210, 2018.
    Google ScholarLocate open access versionFindings
  • Farid Ahmadi and Hamed Moradi. A hybrid method for persian named entity recognition. 2015 7th Conference on Information and Knowledge Technology (IKT), pages 1–7, 2015.
    Google ScholarLocate open access versionFindings
  • Kia Dashtipour, Mandar Gogate, Ahsan Adeel, Abdulrahman Algarafi, Newton Howard, and Amir Hussain. Persian named entity recognition. 2017 IEEE 16th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pages 79–83, 2017.
    Google ScholarLocate open access versionFindings
  • Mohammad Hadi Bokaei and Maryam Mahmoudi. Improved deep persian named entity recognition. 2018 9th International Symposium on Telecommunications (IST), pages 381–386, 2018.
    Google ScholarLocate open access versionFindings
  • Ehsan Taher, Seyed Abbas Hoseini, and Mehrnoush Shamsfard. Beheshti-ner: Persian named entity recognition using bert. ArXiv, abs/2003.08875, 2020.
    Findings
  • Mohammad Bagher Dastgheib, Sara Koleini, and Farzad Rasti. The application of deep learning in persian documents sentiment analysis. International Journal of Information Science and Management, 18:1–15, 2020.
    Google ScholarLocate open access versionFindings
  • Kayvan Bijari, Hadi Zare, Emad Kebriaei, and Hadi Veisi. Leveraging deep graph-based text representation for sentiment polarity applications. Expert Syst. Appl., 144: 113090, 2020.
    Google ScholarLocate open access versionFindings
  • Javad PourMostafa Roshan Sharami, Parsa Abbasi Sarabestani, and Seyed Abolghasem Mirroshandel. Deepsentipers: Novel deep learning models trained over proposed augmented persian sentiment corpus. ArXiv, abs/2004.05328, 2020.
    Findings
  • Pedram Hosseini, Ali Ahmadian Ramaki, Hassan Maleki, Mansoureh Anvari, and Seyed Abolghasem Mirroshandel. Sentipers: A sentiment analysis corpus for persian. ArXiv, abs/1801.07737, 2018.
    Findings
  • Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In LREC, 2012.
    Google ScholarLocate open access versionFindings
  • Pedro Javier Ortiz Suarez, Benoıt Sagot, and Laurent Romary. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In CMLC-7, 2019.
    Google ScholarLocate open access versionFindings
  • Behnam Sabeti, Hossein Abedi Firouzjaee, Ali Janalizadeh Choobbasti, S. H. E. Mortazavi Najafabadi, and Amir Vaheb. Mirastext: An automatically generated text corpus for persian. In LREC, 2018.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
    Findings
  • Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ArXiv, abs/1508.07909, 2016.
    Findings
  • Mahsa Sadat Shahshahani, Mahdi Mohseni, Azadeh Shakery, and Heshaam Faili. Peyma: A tagged corpus for persian named entities. ArXiv, abs/1801.09936, 2018.
    Findings
  • Hanieh Poostchi, Ehsan Zare Borzeshi, and Massimo Piccardi. Bilstm-crf for persian named-entity recognition armanpersonercorpus: the first entity-annotated persian dataset. In LREC, 2018.
    Google ScholarLocate open access versionFindings
  • Nasrin Taghizadeh, Zeinab Borhani-fard, Melika GolestaniPour, and Heshaam Faili. Nsurl-2019 task 7: Named entity recognition (ner) in farsi. ArXiv, abs/2003.09029, 2020.
    Findings
  • Leila Hafezi and Mehdi Rezaeian. Neural architecture for persian named entity recognition. 2018 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), pages 61–64, 2018.
    Google ScholarLocate open access versionFindings
  • Hanieh Poostchi, Ehsan Zare Borzeshi, Mohammad Abdous, and Massimo Piccardi. Personer: Persian namedentity recognition. In COLING, 2016.
    Google ScholarLocate open access versionFindings
Author
Farahani Mehrdad
Farahani Mehrdad
Gharachorloo Mohammad
Gharachorloo Mohammad
Farahani Marzieh
Farahani Marzieh
Manthouri Mohammad
Manthouri Mohammad
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科