AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We compare the task-specific semi-supervised method, Cross-View Training, with a task-agnostic semi-supervised approach, BERT, on a variety of problems that can be modeled as sequence tagging tasks
To BERT or Not to BERT: Comparing Task specific and Task agnostic Semi Supervised Approaches for Sequence Tagging
EMNLP 2020, pp.7927-7934, (2020)
Leveraging large amounts of unlabeled data using Transformer-like architectures, like BERT, has gained popularity in recent times owing to their effectiveness in learning general representations that can then be further fine-tuned for downstream tasks to much success. However, training these models can be costly both from an economic and ...More
PPT (Upload PPT)
- Exploiting unlabeled data to improve performance has become the foundation for many natural language processing tasks.
- The question the authors investigate in this paper is how to effectively use unlabeled data: in a task-agnostic or a task-specific way
- An example of the former is training models on language model (LM) like objectives on a large unlabeled corpus to learn general representations, as in ELMo (Embeddings from Language Models) (Peters et al, 2018) and BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al, 2019).
- Exploiting unlabeled data to improve performance has become the foundation for many natural language processing tasks
- Cross-View Training (CVT) (Clark et al, 2018) is a semi-supervised approach that uses unlabeled data in a task-specific manner, rather than trying to learn general representations that can be used for many downstream tasks
- We focus on three tasks: opinion target expression (OTE) detection; named entity recognition (NER), and slot-labeling, each of which can be modeled as a sequence tagging problem (Xu et al, 2018; Liu et al, 2019a; Louvan and Magnini, 2018)
- We present here metrics-based and resource-based comparison of CVT and BERT models on all tasks
- We compare the task-specific semi-supervised method, CVT, with a task-agnostic semi-supervised approach, BERT, on a variety of problems that can be modeled as sequence tagging tasks
- We find that the CVT-based approach is more robust than BERTbased models across tasks and types of unsupervised data available to them
- Slot-labeling: Slot-labeling is a key component of Natural Language Understanding (NLU) in dialogue systems, which involves labeling words of an utterance with pre-defined attributes - slots
- For this task, the authors use the widely-used MIT-Movie dataset5 as labeled data which contains queries related to movie information, with 12 slot labels such as Plot, Actor, Director, etc.
- For SemEval2016 Restaurants, the authors find the mean F1 from the APBERTBase model to be comparable to that of CVT (p-value 0.26)
- Both models outperform the SOTA baseline.
- For SemEval2014 Laptops, APBERTBase is found to have a statistically significant (p-value 0.04) higher F1 than CVT, and both models outperform SOTA
- The authors compare the task-specific semi-supervised method, CVT, with a task-agnostic semi-supervised approach, BERT, on a variety of problems that can be modeled as sequence tagging tasks.
- The authors find that the CVT-based approach is more robust than BERTbased models across tasks and types of unsupervised data available to them.
- The financial and environmental costs incurred are significantly lower using CVT as compared to BERT.
- The authors intend to implement CVT as a training strategy over Transformers (BERT) and compare it with AdaptivelyPretrained BERT
- Table1: Number of sentences in unlabeled data and default train-test splits of the labeled datasets, for the various tasks
- Table2: Model performance for OTE detection task. The same unlabeled dataset is used for training CVT, Pre-BERTBase and APBERTBase, and Unlabeled Data indicates the approximate number of sentences seen by each model during training, until convergence criteria is met. Wiki+Books and Amazon-L refer to English cased Wikipedia and Books Corpus, and Amazon Laptop Reviews, respectively. <a class="ref-link" id="cXu_et+al_2018_a" href="#rXu_et+al_2018_a">Xu et al (2018</a>) propose DE-CNN, the SOTA baseline for the task. They do not specify the sizes of the unlabeled data used
- Table3: Model performance for NER. The same unlabeled dataset is used for training CVT, Pre-BERTBase and APBERTBase, and Unlabeled Data indicates the approximate number of sentences seen by each model during training, until convergence criteria is met. Cloze (<a class="ref-link" id="cBaevski_et+al_2019_a" href="#rBaevski_et+al_2019_a">Baevski et al, 2019</a>) and BERT-MRC+DSC (<a class="ref-link" id="cLi_et+al_2019_a" href="#rLi_et+al_2019_a">Li et al, 2019</a>) are SOTA baselines for CONLL-2003 and CONLL-2012, respectively, for this task. <a class="ref-link" id="cBaevski_et+al_2019_a" href="#rBaevski_et+al_2019_a">Baevski et al (2019</a>) also use subsampled Common Crawl and News Crawl datasets but do not provide exact splits for these
- Table4: Model performance for Slot-labeling. The same unlabeled dataset is used for training CVT, PreBERTBase and APBERTBase, and Unlabeled Data indicates the approximate number of sentences seen by each model during training, until convergence criteria is met. HSCRF + softdict (<a class="ref-link" id="cLouvan_2018_a" href="#rLouvan_2018_a">Louvan and Magnini, 2018</a>) is the SOTA baseline for this task
- Table5: Estimated CO2 emissions and computational cost for CVT and BERT models, using models trained on Yelp Restaurants (Yelp-R) as an example. These computations hold for other tasks and datasets discussed in this work. HW (hardware) refers to #GPUs/#CPUs used. Cost refers to approximate cost in USD. Power stands for total power consumption (in kWh) as combined GPU, CPU and DRAM consumption, multiplied by Power Usage Effectiveness (PUE) coefficient to account for additional energy needed for infrastructure support (Strubell et al, 2019). CO2 represents CO2 emissions in pounds
- Table6: Validation Set Metrics for all Models
- The usefulness of continued training of large transformer-based models on domain/task-related unlabeled data has been shown recently (Gururangan et al, 2020; Rietzler et al, 2019; Xu et al, 2019), with a varied use of terminology for the process. Xu et al (2019) and Rietzler et al (2019) show gains of further tuning BERT using in-domain unlabeled data and refer to this as Post-training, and LM finetuning, respectively. More recently, Gururangan et al (2020) use the term Domain-Adaptive Pretraining and show benefits over RoBERTa (Liu et al, 2019b). There have also been efforts to reduce model sizes for BERT, such as DistilBERT (Sanh et al, 2019), although these come at significant losses in performance.
Opinion Target Expression (OTE) Detection: An integral component of fine-grained sentiment analysis is the ability to identify segments of text towards which opinions are expressed. These segments are referred to as Opinion Target Expressions or OTEs. An example of this task is provided in Figure 1. The commonly used labeled datasets (a) OTE detection example (b) NER example (c) Slot labeling example for Opinion Target Expression (OTE) detection are those released as part of SemEval Aspect-based Sentiment shared tasks: SemEval-2014 Laptops (Pontiki et al, 2014) (SE14-L) and SemEval-2016 Restaurants (Pontiki et al, 2016) (SE16-R). These consist of reviews from the laptop and restaurant domains, respectively, with OTEs annotated for each sentence of a review. We use the provided train-test splits but further split the training data randomly into 90% training and 10% validation sets. As unlabeled data that is similar to the domain and task, we extract restaurant reviews from the Yelp2 dataset (Yelp-R) and reviews of electronics products from Amazon Product Reviews dataset3 (Amazon-E) (see Table 1).
Study subjects and analysis
In Tables 3 and 4, we present F1 results on NER and Slot-labeling task, respectively. For all 3 datasets, we find CVT to outperform all BERT models (statistically significant for CONLL-2003 and MIT Movies dataset, at p-values 0.0086 and 0.0085, respectively). For these tasks, BERTbase outperforms APBERTBase models
- Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. 2019. Cloze-driven Pretraining of Self-attention Networks. ArXiv, abs/1903.07785.
- Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM.
- Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. 2018. Semi-supervised sequence modeling with cross-view training. arXiv preprint arXiv:1809.08370.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. pages 4171–4186.
- Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. ArXiv, abs/2004.10964.
- S. Kullback and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Statist., 22(1):79–86.
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270, San Diego, California. Association for Computational Linguistics.
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 201BART: Denoising Sequence-to-Sequence pretraining for Natural Language Generation, translation, and comprehension.
- Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2019. Dice Loss for Dataimbalanced NLP Tasks. ArXiv, abs/1911.02855.
- Tianyu Liu, Jin-Ge Yao, and Chin-Yew Lin. 2019a. Towards Improving Neural Named Entity Recognition with Gazetteers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5301–5307, Florence, Italy. Association for Computational Linguistics.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv, abs/1907.11692.
- Samuel Louvan and Bernardo Magnini. 2018. Exploring Named Entity Recognition As an Auxiliary Task for Slot Filling in Conversational Language Understanding. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pages 74–80, Brussels, Belgium. Association for Computational Linguistics.
- Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. ArXiv, abs/1603.01354.
- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective Self-Training for Parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, pages 152–159. Association for Computational Linguistics.
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 20Deep Contextualized Word Representations. In Proc. of NAACL.
- Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, ALSmadi Mohammad, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphee De Clercq, et al. 2016. Semeval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), pages 19–30.
- Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. SemEval-2014 Task 4: Aspect Based Sentiment Analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35, Dublin, Ireland. Association for Computational Linguistics.
- Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. In Proceedings of the Sixteenth Conference on Computational Natural Language Learning (CoNLL 2012), Jeju, Korea.
- Lance A Ramshaw and Mitchell P Marcus. 1999. Text Chunking using Transformation-based Learning. In Natural language processing using very large corpora, pages 157–176. Springer.
- Alexander Rietzler, Sebastian Stabinger, Paul Opitz, and Stefan Engl. 2019. Adapt or Get Left Behind: Domain Adaptation through BERT Language Model Finetuning for Aspect-Target Sentiment Classification.
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
- Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. Green AI.
- Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics.
- Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Chang Xu, Dacheng Tao, and Chao Xu. 2013. A survey on multi-view learning. arXiv preprint arXiv:1304.5634.
- Hu Xu, Bing Liu, Lei Shu, and Philip Yu. 2019. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2324–2335, Minneapolis, Minnesota. Association for Computational Linguistics.
- Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2018. Double Embeddings and CNN-based Sequence Labeling for Aspect Extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 592–598, Melbourne, Australia. Association for Computational Linguistics.
- David Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In 33rd annual meeting of the association for computational linguistics.
- 11https://www.nltk.org/api/nltk. tokenize.html
- 14https://www.clips.uantwerpen.be/ conll2003/ner/bin/conlleval