AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We have introduced an interactive visualization, EXBERT, that uses linguistic annotations, interactive masking, and nearest neighbor search to help revealing an intelligible structure about learned representations in transformer models

exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models

ACL, pp.187-196, (2020)

被引用47|浏览353
EI
下载 PDF 全文
引用
微博一下

摘要

Large language models can produce powerful contextual representations that lead to improvements across many NLP tasks. Since these models are typically guided by a sequence of learned self attention mechanisms and may comprise undesired inductive biases, it is paramount to be able to explore what the attention has learned. While static ...更多

代码

数据

0
简介
  • Neural networks based on the Transformer architecture have led to impressive improvements across many Natural Language Processing (NLP) tasks such as machine translation and text summarization [Vaswani et al, 2017].
  • Similar to the static analysis by Clark et al [2019], EXBERT provides insights into both the attention and the token embeddings for the user-defined model and corpus by probing whether the representations capture metadata such as linguistic features or positional information.
  • The most similar token embeddings, defined by a nearest neighbor search, can be viewed in their corpus’ context in a language that humans can understand.
重点内容
  • Neural networks based on the Transformer architecture have led to impressive improvements across many Natural Language Processing (NLP) tasks such as machine translation and text summarization [Vaswani et al, 2017]
  • The Transformer is based on subsequent application of “multi-head attention” to route the model reasoning, and this technique’s flexibility allows it to be pretrained on large corpora to generate contextual representations that can be used for other tasks
  • Some research has focused on understanding whether BERT learns linguistic features such as Part Of Speech (POS), Dependency relationships (DEP), or Named Entity Recognition (NER) [e.g., Tenney et al, 2019a, Vig and Belinkov, 2019, Raganto and Tiedemann, 2018, Tenney et al, 2019b]
  • We have introduced an interactive visualization, EXBERT, that uses linguistic annotations, interactive masking, and nearest neighbor search to help revealing an intelligible structure about learned representations in transformer models
  • The source code and demo are available at www.exbert.net, providing the community an opportunity to rapidly experiment with learned Transformer representations and gain a better understanding of what these models learn
结果
  • The Summary View shows histogram summaries of the matched metadata, which is useful for getting a snapshot of the metadata an embedding encodes in the searched corpus.
  • Inspired by Strobelt et al [2017, 2018], EXBERT performs a nearest neighbor search of embeddings on a reference corpus that is processed with linguistic features as follows.
  • The reference corpus used is the Wizard of Oz,2 which is annotated and processed by BERT to allow for nearest neighbor searching.
  • Special tokens like “[CLS]” and “[SEP]” have no linguistic features assigned to them, and are removed from the reference corpus, which allows searches to always match a token that has intuitive meaning for users.
  • The authors explore the layers and heads at which BERT learns the linguistic features of a masked token.
  • Searching by the masked token’s embedding helps show the information that is captured within the token itself, but it is useful to understand how the heads of the previous layer contributed to that information being encoded in the embedding.
  • A search by head embedding at that point reveals that BERT has already learned to attend to sentence structures where the most similar tokens in the corpus are verbs (2c).
  • This is useful, but it is not clear how all the heads were able to maximize their attention on the “din” token and detect the DOBJ pattern that was in 18 of the top 50 matches in the search.
结论
  • The DEP summary at the bottom of Figure 3d shows that does the head match the POS of the seed token, but it has learned to look for cases where the word following a preposition is possessive.
  • The authors have introduced an interactive visualization, EXBERT, that uses linguistic annotations, interactive masking, and nearest neighbor search to help revealing an intelligible structure about learned representations in transformer models.
  • The source code and demo are available at www.exbert.net, providing the community an opportunity to rapidly experiment with learned Transformer representations and gain a better understanding of what these models learn.
引用论文
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537, 2019.
    Findings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
    Google ScholarLocate open access versionFindings
  • Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
    Findings
  • Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id= SJzSgnRcKX.
    Locate open access versionFindings
  • Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. CoRR, abs/1906.04284, 2019. URL http://arxiv.org/abs/1906.04284.
    Findings
  • Alessandro Raganto and Jorg Tiedemann. An analysis of encoder representations in transformerbased machine translation. 2018. URL https://www.aclweb.org/anthology/W18-5431. In EMNLP Workshop: BlackboxNLP.
    Locate open access versionFindings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. CoRR, abs/1905.05950, 2019b. URL http://arxiv.org/abs/1905.05950.
    Findings
  • Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of bert’s attention. CoRR, abs/1906.04341, 2019. URL http://arxiv.org/abs/1906.04341.
    Findings
  • Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head selfattention: Specialized heads do the heavy lifting, the rest can be pruned. CoRR, abs/1905.09418, 2019. URL http://arxiv.org/abs/1905.09418.
    Findings
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Transformers: Stateof-the-art natural language processing, 2019.
    Google ScholarLocate open access versionFindings
  • Jesse Vig. A multiscale visualization of attention in the transformer model. CoRR, abs/1906.05714, 2019. URL http://arxiv.org/abs/1906.05714.
    Findings
  • Gino Brunner, Yang Liu, Damián Pascual, Oliver Richter, and Roger Wattenhofer. On the validity of self-attention as explanation in transformer models. 2019. URL https://arxiv.org/abs/1908.04211.
    Findings
  • Sarthak Jain and Byron C Wallace. Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, 2019.
    Google ScholarLocate open access versionFindings
  • Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. 20URL https://arxiv.org/abs/1908.04626.
    Findings
  • Hendrik Strobelt, Sebastian Gehrmann, Hanspeter Pfister, and Alexander M Rush. LSTMVis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE transactions on visualization and computer graphics, 24(1):667–676, 2017.
    Google ScholarLocate open access versionFindings
  • Hendrik Strobelt, Sebastian Gehrmann, Michael Behrisch, Adam Perer, Hanspeter Pfister, and Alexander M. Rush. Seq2seq-vis: A visual debugging tool for sequence-to-sequence models. CoRR, abs/1804.09299, 2018. URL http://arxiv.org/abs/1804.09299.
    Findings
  • Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 2019.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909, 2015. URL http://arxiv.org/abs/1508.07909.
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科