AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We propose to directly model recurrence for Transformer with an additional recurrence encoder

Modeling Recurrence for Transformer.

North American Chapter of the Association for Computational Linguistics, (2019): 1198-1207

被引用32|浏览62
EI
下载 PDF 全文
引用
微博一下

摘要

Recently, the Transformer model that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence hinders its further improvement of translation capacity. In response to this problem, we propose to directly model recurrence for T...更多

代码

数据

0
简介
  • Transformer (Vaswani et al, 2017) – a new network architecture based solely on attention mechanisms, has advanced the state-of-the-art on various translation tasks across language pairs.

    Compared with the conventional recurrent neural network (RNN) (Schuster and Paliwal, 1997)

    based model that leverages recurrence as the basic building module (Sutskever et al, 2014; Bahdanau et al, 2015; Chen et al, 2018), Transformer replaces RNN with self-attention network (SAN)

    to model the dependencies among input elements.

    One appealing strength of SAN is that it breaks ∗

    Zhaopeng Tu is the corresponding author of the paper.

    This work was conducted when Jie Hao and Baosong Yang were interning at Tencent AI Lab.

    down the sequential assumption to obtain the ability of highly parallel computation: input elements interact with each other simultaneously without regard to their distance.

    prior studies empirically show that the lack of recurrence modeling hinders Transformer from further improvement of translation quality (Dehghani et al, 2019).
  • Based model that leverages recurrence as the basic building module (Sutskever et al, 2014; Bahdanau et al, 2015; Chen et al, 2018), Transformer replaces RNN with self-attention network (SAN).
  • Modeling recurrence is crucial for capturing several essential properties of input sequence, such as structural representations (Tran et al, 2016) and positional encoding (Shaw et al, 2018), which are exactly the weaknesses of SAN (Tran et al, 2018).
重点内容
  • Transformer (Vaswani et al, 2017) – a new network architecture based solely on attention mechanisms, has advanced the state-of-the-art on various translation tasks across language pairs.

    Compared with the conventional recurrent neural network (RNN) (Schuster and Paliwal, 1997)

    based model that leverages recurrence as the basic building module (Sutskever et al, 2014; Bahdanau et al, 2015; Chen et al, 2018), Transformer replaces RNN with self-attention network (SAN)

    to model the dependencies among input elements.

    One appealing strength of SAN is that it breaks ∗

    Zhaopeng Tu is the corresponding author of the paper.

    This work was conducted when Jie Hao and Baosong Yang were interning at Tencent AI Lab.

    down the sequential assumption to obtain the ability of highly parallel computation: input elements interact with each other simultaneously without regard to their distance.

    prior studies empirically show that the lack of recurrence modeling hinders Transformer from further improvement of translation quality (Dehghani et al, 2019)
  • Chen et al (2018) show that the representations learned by SAN-based and RNNbased encoders are complementary to each other, and merging them can improve translation performance for RNN-based NMT models. Starting from these findings, we propose to directly model recurrence for Transformer with an additional recurrence encoder
  • In addition to the standard RNN, we propose to implement recurrence modeling with a novel attentive recurrent network (ARN), which combines advantages of both SAN
  • We propose a novel attentive recurrent network to implement the additional recurrence encoder in
  • We propose to directly model recurrence for Transformer with an additional recurrence encoder
  • We present the short-cut connection between the recurrence encoder and the decoder that we found very effective to use the learned representation to improve the translation performance under the proposed architecture
  • To effectively feed the recurrence representations to the decoder to guide the output sequence generation, we study two strategies to integrate the recurrence encoder into the Transformer
结果
  • Performances across Languages the authors evaluated the proposed approach on the widely used WMT17 Zh⇒En and WMT14 En⇒De data, as listed in Table 3.

    To make the evaluation convincing, the authors reviewed the prior reported systems, and built strong System Zh⇒En # Para.
  • Performances across Languages the authors evaluated the proposed approach on the widely used WMT17 Zh⇒En and WMT14 En⇒De data, as listed in Table 3.
  • To make the evaluation convincing, the authors reviewed the prior reported systems, and built strong System Zh⇒En # Para.
  • T RANSFORMER -B IG n/a 24.2.
  • R NMT + S AN Encoder n/a n/a The authors' NMT systems.
  • T RANSFORMER -BASE 107.9M 24.13
结论
  • The authors propose to directly model recurrence for Transformer with an additional recurrence encoder.
  • The recurrence encoder is used to generate recurrence representations for the input sequence.
  • To effectively feed the recurrence representations to the decoder to guide the output sequence generation, the authors study two strategies to integrate the recurrence encoder into the Transformer.
  • To evaluate the effectiveness of the proposed model, the authors conduct experiments on largescale WMT14 EN⇒DE and WMT17 ZH⇒EN datasets.
  • Linguistic analyses on probing tasks further show that the model generates more informative representations, especially representative on syntactic structure features
总结
  • Introduction:

    Transformer (Vaswani et al, 2017) – a new network architecture based solely on attention mechanisms, has advanced the state-of-the-art on various translation tasks across language pairs.

    Compared with the conventional recurrent neural network (RNN) (Schuster and Paliwal, 1997)

    based model that leverages recurrence as the basic building module (Sutskever et al, 2014; Bahdanau et al, 2015; Chen et al, 2018), Transformer replaces RNN with self-attention network (SAN)

    to model the dependencies among input elements.

    One appealing strength of SAN is that it breaks ∗

    Zhaopeng Tu is the corresponding author of the paper.

    This work was conducted when Jie Hao and Baosong Yang were interning at Tencent AI Lab.

    down the sequential assumption to obtain the ability of highly parallel computation: input elements interact with each other simultaneously without regard to their distance.

    prior studies empirically show that the lack of recurrence modeling hinders Transformer from further improvement of translation quality (Dehghani et al, 2019).
  • Based model that leverages recurrence as the basic building module (Sutskever et al, 2014; Bahdanau et al, 2015; Chen et al, 2018), Transformer replaces RNN with self-attention network (SAN).
  • Modeling recurrence is crucial for capturing several essential properties of input sequence, such as structural representations (Tran et al, 2016) and positional encoding (Shaw et al, 2018), which are exactly the weaknesses of SAN (Tran et al, 2018).
  • Objectives:

    The aim of this paper is not to explore this whole space but to show that some fairly straightforward implementations work well.
  • The authors' approach is complementary to theirs, since they focus on improving the representation power of SAN encoder, while the authors aim to complement SAN encoder with an additional recurrence encoder
  • Results:

    Performances across Languages the authors evaluated the proposed approach on the widely used WMT17 Zh⇒En and WMT14 En⇒De data, as listed in Table 3.

    To make the evaluation convincing, the authors reviewed the prior reported systems, and built strong System Zh⇒En # Para.
  • Performances across Languages the authors evaluated the proposed approach on the widely used WMT17 Zh⇒En and WMT14 En⇒De data, as listed in Table 3.
  • To make the evaluation convincing, the authors reviewed the prior reported systems, and built strong System Zh⇒En # Para.
  • T RANSFORMER -B IG n/a 24.2.
  • R NMT + S AN Encoder n/a n/a The authors' NMT systems.
  • T RANSFORMER -BASE 107.9M 24.13
  • Conclusion:

    The authors propose to directly model recurrence for Transformer with an additional recurrence encoder.
  • The recurrence encoder is used to generate recurrence representations for the input sequence.
  • To effectively feed the recurrence representations to the decoder to guide the output sequence generation, the authors study two strategies to integrate the recurrence encoder into the Transformer.
  • To evaluate the effectiveness of the proposed model, the authors conduct experiments on largescale WMT14 EN⇒DE and WMT17 ZH⇒EN datasets.
  • Linguistic analyses on probing tasks further show that the model generates more informative representations, especially representative on syntactic structure features
表格
  • Table1: Evaluation of recurrence encoder implementations. The output of recurrence encoder is fed to the top decoder layer in a stack fusion. “Speed” denotes the training speed (steps/second)
  • Table2: Evaluation of decoder integration strategies
  • Table3: Comparing with the existing NMT systems on WMT17 Zh⇒En and WMT14 En⇒De test sets. “↑ / ⇑”
  • Table4: Comparison with re-implemented related work: “R EL27P OS”: relative position encoding (Shaw et al, 2018), “D I SAN”: directional SANOurs
  • Table5: Classification accuracies on 10 probing tasks of evaluating linguistics embedded in the encoder outputs
Download tables as Excel
相关工作
  • Improving Transformer Encoder From the perspective of representation learning, there has been an increasing amount of work on improving the representation power of SAN encoder. Bawden et al (2018) and Voita et al (2018) exploit external context for SAN encoder, while Yang et al (2019) leverage the intermediate representations to contextualize the transformations in SAN.

    A number of recent efforts have explored ways to improve multi-head SAN by encouraging individual attention heads to extract distinct information (Strubell et al, 2018; Li et al, 2018).

    Concerning multi-layer SAN encoder, Dou et al.

    (2018, 2019) and Wang et al (2018) propose to aggregate the multi-layer representations, and Dehghani et al (2019) recurrently refine these representations. Our approach is complementary to theirs, since they focus on improving the representation power of SAN encoder, while we aim to complement SAN encoder with an additional recurrence encoder.

    Along the direction of modeling recurrence for
基金
  • J.Z. was supported by the National Institute of General Medical Sciences of the National Institute of Health under award number R01GM126558. We thank the anonymous reviewers for their insightful comments. Ziyi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. 2018
研究对象与分析
data: 1
where R EC(·) is the function of recurrence modeling. Note that at the bottom layer of the recurrence encoder (N =1), we do not employ a residual connection on the recurrence sub-layer (i.e. Equation 7), which releases the constraint that C1r should share the same length with input embeddings sequence Ein 1

引用论文
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
    Findings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
    Google ScholarFindings
  • Barry Haddow. 2018. Evaluating Discourse Phenomena in Neural Machine Translation. In NAACL.
    Google ScholarLocate open access versionFindings
  • 2018. The best of both worlds: Combining recent advances in neural machine translation. In ACL.
    Google ScholarFindings
  • Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
    Google ScholarFindings
  • 2017. Hierarchical Multiscale Recurrent Neural
    Google ScholarFindings
  • Lample, Loıc Barrault, and Marco Baroni. 2018.
    Google ScholarFindings
  • Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In ICLR.
    Google ScholarFindings
  • 2003. Statistical phrase-based translation. In ACL.
    Google ScholarFindings
  • Jian Li, Zhaopeng Tu, Baosong Yang, Michael R. Lyu, and Tong Zhang. 2018. Multi-Head Attention with
    Google ScholarFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
    Google ScholarFindings
  • Zettlemoyer. 2018. Deep contextualized word representations. In NAACL.
    Google ScholarFindings
  • Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681.
    Google ScholarLocate open access versionFindings
  • 2016. Neural machine translation of rare words with subword units. In ACL.
    Google ScholarFindings
  • 2018. Self-Attention with Relative Position Representations. In NAACL.
    Google ScholarFindings
  • Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. DiSAN: directional self-attention network for RNN/CNN-free language understanding. In AAAI.
    Google ScholarFindings
  • Aaron Courville. 2019. Ordered neurons: Integrating tree structures into recurrent neural networks. In
    Google ScholarLocate open access versionFindings
  • Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018.
    Google ScholarFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
    Google ScholarFindings
  • Titov. 2018. Context-aware neural machine translation learns anaphora resolution. In ACL.
    Google ScholarFindings
  • Ke Tran, Arianna Bisazza, and Christof Monz. 2016.
    Google ScholarFindings
  • Qiang Wang, Fuxue Li, Tong Xiao, Yanyang Li, Yinqiao Li, and Jingbo Zhu. 2018. Multi-layer representation fusion for neural machine translation. In
    Google ScholarLocate open access versionFindings
  • Ke Tran, Arianna Bisazza, and Christof Monz. 2018.
    Google ScholarFindings
  • Zhaopeng Tu, Yang Liu, Zhengdong Lu, Xiaohua Liu, and Hang Li. 2017. Context gates for neural machine translation. TACL.
    Google ScholarLocate open access versionFindings
  • Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
    Google ScholarFindings
  • Baosong Yang, Jian Li, Derek F. Wong, Lidia S. Chao, Xing Wang, and Zhaopeng Tu. 2019. Context-aware self-attention networks. In AAAI.
    Google ScholarFindings
  • Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, and William W Cohen. 2016. Review networks for caption generation. In NIPS.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
小科