Tasty Burgers, Soggy Fries: Probing Aspect Robustness in Aspect Based Sentiment Analysis

EMNLP 2020, pp. 3594-3605, 2020.

Other Links: arxiv.org|academic.microsoft.com
Weibo:
Aspect-based sentiment analysis is an advanced sentiment analysis task that aims to classify the sentiment towards a specific aspect

Abstract:

Aspect-based sentiment analysis (ABSA) aims to predict the sentiment towards a specific aspect in the text. However, existing ABSA test sets cannot be used to probe whether a model can distinguish the sentiment of the target aspect from the non-target aspects. To solve this problem, we develop a simple but effective approach to enrich ABS...More

Code:

Data:

0
Introduction
  • Aspect-based sentiment analysis (ABSA) is an advanced sentiment analysis task that aims to classify the sentiment towards a specific aspect.
  • If a model makes the correct sentiment classification for burgers in the original sentence “Tasty burgers, and crispy fries”, it should flip its prediction when seeing the new context “Terrible burgers, but crispy fries”.
  • These questions together form a probe to verify if an ABSA model has high aspect robustness
Highlights
  • Aspect-based sentiment analysis (ABSA) is an advanced sentiment analysis task that aims to classify the sentiment towards a specific aspect
  • We probe the aspect robustness of nine models, and reveal up to 69.73% performance drop on Aspect Robustness Test Set (ARTS) compared with the original test set
  • We propose a novel metric, Aspect Robustness Score (ARS), that counts the correct classification of the source example and all its variations (REVTGT, REVNON, and ADDDIFF) as one unit of correctness
  • We proposed a simple but effective mechanism to generate test samples to probe the aspect robustness of the models
Results
  • The authors list the accuracy5 of the nine models on the Laptop and Restaurant test sets in Table 7.
  • Model Entire Test.
  • Ori → New (Change) Laptop Dataset.
  • MemNet 64.42 → 16.93 (↓47.49).
  • GatedCNN 65.67 → 10.34 (↓55.33).
  • AttLSTM 67.55 → 09.87 (↓57.68).
  • TD-LSTM 68.03 → 22.57 (↓45.46) GCN
Conclusion
  • The authors proposed a simple but effective mechanism to generate test samples to probe the aspect robustness of the models.
  • The authors enhanced the original SemEval 2014 test sets by 294% and 315% in laptop and restaurant domains.
  • Using the new test set, the authors probed the aspect robustness of nine ABSA models, and discussed model designs and better training that can improve the robustness
Summary
  • Introduction:

    Aspect-based sentiment analysis (ABSA) is an advanced sentiment analysis task that aims to classify the sentiment towards a specific aspect.
  • If a model makes the correct sentiment classification for burgers in the original sentence “Tasty burgers, and crispy fries”, it should flip its prediction when seeing the new context “Terrible burgers, but crispy fries”.
  • These questions together form a probe to verify if an ABSA model has high aspect robustness
  • Objectives:

    The authors aim to build a systematic method to generate all possible aspect-related alternations, in order to remove the confounding factors in the existing ABSA data.
  • The authors aim to generate a new sentence that flips the sentiment Tasty
  • Results:

    The authors list the accuracy5 of the nine models on the Laptop and Restaurant test sets in Table 7.
  • Model Entire Test.
  • Ori → New (Change) Laptop Dataset.
  • MemNet 64.42 → 16.93 (↓47.49).
  • GatedCNN 65.67 → 10.34 (↓55.33).
  • AttLSTM 67.55 → 09.87 (↓57.68).
  • TD-LSTM 68.03 → 22.57 (↓45.46) GCN
  • Conclusion:

    The authors proposed a simple but effective mechanism to generate test samples to probe the aspect robustness of the models.
  • The authors enhanced the original SemEval 2014 test sets by 294% and 315% in laptop and restaurant domains.
  • Using the new test set, the authors probed the aspect robustness of nine ABSA models, and discussed model designs and better training that can improve the robustness
Tables
  • Table1: The generation strategies and examples of the prerequisite (Prereq) and three questions (Q1)-(Q3). Each example are annotated with the target aspect (Tgt), and altered sentence parts
  • Table2: Three strategies and examples of REVTGT
  • Table3: The generation process of REVNON. The target aspect (Tgt), and sentiments of all aspects are annotated
  • Table4: Example aspect expressions from AspectSet of the restaurant domain
  • Table5: Overall statistics of the ARTS test set and results of fluency and sentiment checks
  • Table6: Characteristics of the ARTS test sets in comparison to the Original (“Ori”) Laptop and Restaurant test sets
  • Table7: Model accuracy on Laptop and Restaurant data. We compare the accuracy on the Original and our New test sets (Ori → New), and calculate the change of accuracy. Besides the Entire Test Set, we also list accuracy on subsets where the generation strategies REVTGT, REVNON and ADDDIFF can be applied. The accuracy of Entire Test-New is calculated using ARS. indicates whether the performance drop is statistically significant (with p-value ≤ 0.05 by Welch’s t-test)
  • Table8: The accuracy of each model on the original test set and the new test set generated by REVNON+ADDDIFF in laptop and restaurant domains
  • Table9: Models in the ascending order of their ARS on Laptop. We list their aspect-specific mechanisms, including concatenating the aspect and word embeddings (Asp+W Emb), position-aware mechanism for aspects (Posi-Aware), and attention using the aspect (Asp Att). We highlight for Posi-Aware as it is the most related to aspect robustness for non-BERT models
  • Table10: Improvements on the new test set using different training data
Download tables as Excel
Related work
Funding
  • This work was partially funded by China National Key R&D Program (No 2018YFC0831105, 2018YFB1005104, 2017YFB1002104), National Natural Science Foundation of China (No 61751201, 61976056, 61532011), Shanghai Municipal Science and Technology Major Project (No.2018SHZDZX01), Science and Technology Commission of Shanghai Municipality Grant (No.18DZ1201000, 17JC1420200)
Study subjects and analysis
Laptop and Restaurant datasets: 2014
For example, the Twitter dataset (Dong et al, 2014) has only one aspect per sentence, so the model does not need to discriminate against non-target aspects. In the most widely used SemEval 2014 Laptop and Restaurant datasets (Pontiki et al, 2014), for 83.9% and 79.6% samples in the test sets, the sentiments of the target aspect, and all non-target aspects are all the same. Hence, we cannot decide whether models that make correct classifications attend only to the target aspect, because they may also wrongly look at the non-target aspects, which are confounding factors

samples: 59
Only a small portion of the test set can be used to answer our target questions proposed in the beginning. Moreover, when we test on the subset of the test set (59 samples in Laptop, and 122 samples in Restaurant) where the target aspect sentiment differs from all non-target aspect sentiments (so that the confounding factor is disentangled), the best model (Xu et al, 2019a) drops from 78.53% to 59.32% on Laptop and from 86.70% to 63.93% on Restaurant. This implies that the success of pre-

samples: 1877
In this way, we produce an “all-rounded” test set that can test whether a model robustly captures the target sentiment instead of using other irrelevant clues. We enriched the laptop dataset by 294% from 638 to 1,877 samples and the restaurant dataset by 315% from 1,120 to 3,530 samples. By human evaluation, more than 92% of the new aspect robustness test set (ARTS) shows high fluency, and desired sentiment on all aspects

samples: 1877
The resulting Laptop dataset has 2,163 training, 150 validation, and 638 test instances, and Restaurant has 3,452 training, 150 validation, and 1,120 test instances. Building upon the original SemEval 2014 data, we generate enriched test sets of 1,877 samples (294% of the original size) in the laptop domain, and 3,530 samples (315%) in the restaurant domain using generation method introduced in Section 2. The statistics of our ARTS test set are in Table 5

Reference
  • Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, and Kai-Wei Chang. 2018. Generating natural language adversarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2890–2896. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 201Adaptive recursive neural network for target-dependent twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 49–54, Baltimore, Maryland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. Hotflip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 31–36. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Zhifang Fan, Zhen Wu, Xin-Yu Dai, Shujian Huang, and Jiajun Chen. 2019. Target-oriented opinion words extraction with target-fused neural sequence labeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2509–2518. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Gayatree Ganu, Noemie Elhadad, and Amelie Marian. 2009. Beyond the stars: Improving rating predictions using review text content. In 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009.
    Google ScholarLocate open access versionFindings
  • Max Glockner, Vered Shwartz, and Yoav Goldberg. 201Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 650–655. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ruining He and Julian J. McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 507–517. ACM.
    Google ScholarLocate open access versionFindings
  • Yu-Lun Hsieh, Minhao Cheng, Da-Cheng Juan, Wei Wei, Wen-Lian Hsu, and Cho-Jui Hsieh. 2019. On the robustness of self-attentive models. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1520–1529. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Robin Jia and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12–22, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Qingnan Jiang, Lei Chen, Ruifeng Xu, Xiang Ao, and Min Yang. 2019. A challenge dataset and effective models for aspect-based sentiment analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 6279– 6284. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019a. Is BERT really robust? natural language attack on text classification and entailment. CoRR, abs/1907.11932.
    Findings
  • Zhijing Jin, Di Jin, Jonas Mueller, Nicholas Matthews, and Enrico Santus. 2019b. Imat: Unsupervised text attribute transfer via iterative matching and translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3095–3107. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Vidur Joshi, Matthew E. Peters, and Mark Hopkins. 2018. Extending a parser to distant domains using a few dozen partially annotated examples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1190–1199. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Svetlana Kiritchenko, Xiaodan Zhu, Colin Cherry, and Saif Mohammad. 2014. Nrc-canada-2014: Detecting aspects and sentiment in customer reviews. In Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, Dublin, Ireland, August 23-24, 2014, pages 437–442. The Association for Computer Linguistics.
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. CoRR, abs/1612.08220.
    Findings
  • Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017. Interactive attention networks for aspect-level sentiment classification. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 4068– 4074. ijcai.org.
    Google ScholarLocate open access versionFindings
  • George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41.
    Google ScholarLocate open access versionFindings
  • Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. SemEval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35, Dublin, Ireland. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi S. Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6830– 6841.
    Google ScholarLocate open access versionFindings
  • Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. 2016a. Effective lstms for target-dependent sentiment classification. In COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages 3298– 3307. ACL.
    Google ScholarLocate open access versionFindings
  • Duyu Tang, Bing Qin, and Ting Liu. 2016b. Aspect level sentiment classification with deep memory network. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 214–224. The Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Duy-Tin Vo and Yue Zhang. 2015. Target-dependent twitter sentiment classification with rich automatic features. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 1347–1353. AAAI Press.
    Google ScholarLocate open access versionFindings
  • Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. 2016. Attention-based LSTM for aspectlevel sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 606–615. The Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019a. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 27, 2019, Volume 1 (Long and Short Papers), pages 2324–2335. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019b. A failure of aspect sentiment classifiers and an adaptive re-weighting solution. CoRR, abs/1911.01460.
    Findings
  • Wei Xue and Tao Li. 2018. Aspect based sentiment analysis with gated convolutional networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2514–2523. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chen Zhang, Qiuchi Li, and Dawei Song. 2019a. Aspect-based sentiment classification with aspectspecific graph convolutional networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4567–4577. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yuan Zhang, Jason Baldridge, and Luheng He. 2019b. PAWS: paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 1298–1308. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments