Monotonic Multihead Attention

James Cross
James Cross
Liezl Puzon
Liezl Puzon

ICLR, 2020.

Cited by: 1|Bibtex|Views68
EI
Other Links: dblp.uni-trier.de|arxiv.org
Weibo:
We propose two variants of the monotonic multihead attention model for simultaneous machine translation

Abstract:

Simultaneous machine translation models start generating a target sequence before they have encoded or read the source sequence. Recent approach for this task either apply a fixed policy on transformer, or a learnable monotonic attention on a weaker recurrent neural network based structure. In this paper, we propose a new attention mechan...More

Code:

Data:

0
Introduction
  • Simultaneous machine translation adds the capability of a live interpreter to machine translation: a simultaneous model starts generating a translation before it has finished reading the entire source sentence.
  • MILk outperforms hard monotonic attention and MoChA; while the other two monotonic attention mechanisms only consider a fixed window, MILk computes a softmax attention over all previous encoder states, which may be the key to its improved latency-quality tradeoffs.
  • These monotonic attention approaches provide a closed-form expression for the expected alignment between source and target tokens
Highlights
  • Simultaneous machine translation adds the capability of a live interpreter to machine translation: a simultaneous model starts generating a translation before it has finished reading the entire source sentence
  • Monotonic attention mechanisms fall into the flexible policy category, in which the policies are automatically learned from data
  • monotonic infinite lookback attention outperforms hard monotonic attention and monotonic chunkwise attention; while the other two monotonic attention mechanisms only consider a fixed window, monotonic infinite lookback attention computes a softmax attention over all previous encoder states, which may be the key to its improved latency-quality tradeoffs
  • We propose two variants of the monotonic multihead attention model for simultaneous machine translation
  • By introducing two new targeted loss terms which allow us to control both latency and attention span, we are able to leverage the power of the Transformer architecture to achieve better quality-latency trade-offs than the previous state-of-the-art model
  • We present detailed ablation studies demonstrating the efficacy and rationale of our approach
Methods
Results
  • The authors present the main results of the model in terms of latency-quality tradeoffs, ablation studies and analyses.
  • The authors analyze the effect of the variance loss on the attention span.
  • The authors study the effect of the number of decoder layers and decoder heads on quality and latency.
  • The authors provide a case study for the behavior of attention heads in an example.
  • The authors study the relationship between the rank of an attention head and the layer it belongs to.
Conclusion
  • The authors propose two variants of the monotonic multihead attention model for simultaneous machine translation.
  • By introducing two new targeted loss terms which allow them to control both latency and attention span, the authors are able to leverage the power of the Transformer architecture to achieve better quality-latency trade-offs than the previous state-of-the-art model.
  • The authors present detailed ablation studies demonstrating the efficacy and rationale of the approach
  • By introducing these stronger simultaneous sequence-to-sequence models, the authors hope to facilitate important applications, such as high-quality real-time interpretation between human speakers
Summary
  • Introduction:

    Simultaneous machine translation adds the capability of a live interpreter to machine translation: a simultaneous model starts generating a translation before it has finished reading the entire source sentence.
  • MILk outperforms hard monotonic attention and MoChA; while the other two monotonic attention mechanisms only consider a fixed window, MILk computes a softmax attention over all previous encoder states, which may be the key to its improved latency-quality tradeoffs.
  • These monotonic attention approaches provide a closed-form expression for the expected alignment between source and target tokens
  • Methods:

    The authors evaluate the model using quality and latency.
  • The authors use tokenized BLEU 4 for IWSLT15 En-Vi and detokenized BLEU with SacreBLEU (Post, 2018) for WMT15 De-En. For latency, the authors use three different recent metrics, Average Proportion (AP) (Cho & Esipova, 2016), Average Lagging (AL) (Ma et al, 2019) and Differentiable Average Lagging (DAL) (Arivazhagan et al, 2019) 5.
  • WMT15 De-En 28.4 (Arivazhagan et al, 2019)
  • Results:

    The authors present the main results of the model in terms of latency-quality tradeoffs, ablation studies and analyses.
  • The authors analyze the effect of the variance loss on the attention span.
  • The authors study the effect of the number of decoder layers and decoder heads on quality and latency.
  • The authors provide a case study for the behavior of attention heads in an example.
  • The authors study the relationship between the rank of an attention head and the layer it belongs to.
  • Conclusion:

    The authors propose two variants of the monotonic multihead attention model for simultaneous machine translation.
  • By introducing two new targeted loss terms which allow them to control both latency and attention span, the authors are able to leverage the power of the Transformer architecture to achieve better quality-latency trade-offs than the previous state-of-the-art model.
  • The authors present detailed ablation studies demonstrating the efficacy and rationale of the approach
  • By introducing these stronger simultaneous sequence-to-sequence models, the authors hope to facilitate important applications, such as high-quality real-time interpretation between human speakers
Tables
  • Table1: Number of sentences in each split
  • Table2: Offline model performance with unidirectional encoder and greedy decoding
  • Table3: Effect of using a unidirectional encoder and greedy decoding to BLEU score
  • Table4: Offline and monotonic models hyperparameters
  • Table5: The calculation of latency metrics, given source x, target y and delays g
  • Table6: Detailed results for MMA-H and MMA-IL on WMT15 DeEn
  • Table7: Detailed results for MILk, MMA-H and MMA-IL on IWSLT15 En-Vi
  • Table8: Comparison between setting threshold for reading action and weighted average latency loss
  • Table9: Detailed numbers on average loss, weighted average loss and head divergence loss on WMT15 De-En development set
Download tables as Excel
Related work
  • Recent work on simultaneous machine translation falls into three categories. In the first one, models use a rule-based policy for reading input and writing output. Cho & Esipova (2016) propose a WaitIf-* policy to enable an offline model to decode simultaneously. Ma et al (2019) propose a wait-k policy where the model first reads k tokens, then alternates between read and write actions. Dalvi et al (2018) propose an incremental decoding method, also based on a rule-based schedule. In the second category, a flexible policy is learnt from data. Grissom II et al (2014) introduce a Markov chain to phrase-based machine translation models for simultaneous machine translation, in which they apply reinforcement learning to learn the read-write policy based on states. Gu et al (2017) introduce an agent which learns to make decisions on when to translate from the interaction with a pre-trained offline neural machine translation model. Luo et al (2017) used continuous rewards policy gradient for online alignments for speech recognition. Lawson et al (2018) proposed a hard alignment with variational inference for online decoding. Alinejad et al (2018) propose a new operation ”predict” which predicts future source tokens. Zheng et al (2019b) introduce a restricted dynamic oracle and restricted imitation learning for simultaneous translation. Zheng et al (2019a) train the agent with an action sequence from labels that are generated based on the rank of the gold target word given partial input. Models from the last category leverage monotonic attention and replace the softmax attention with an expected attention calculated from a stepwise Bernoulli selection probability. Raffel et al (2017) first introduce the concept of monotonic attention for online linear time decoding, where the attention only attends to one encoder state at a time. Chiu & Raffel (2018) extended that work to let the model attend to a chunk of encoder state. Arivazhagan et al (2019) also make use of the monotonic attention but introduce an infinite lookback to improve the translation quality.
Funding
  • Proposes a new attention mechanism, Monotonic Multihead Attention , which extends the monotonic attention mechanism to multihead attention
  • Introduces two novel and interpretable approaches for latency control that are designed for multiple attention heads
  • Analyzes how the latency controls affect the attention span and studies the relationship between the speed of a head and the layer it belongs to
Reference
  • Ashkan Alinejad, Maryam Siahbani, and Anoop Sarkar. Prediction improves simultaneous neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3022–3027, 2018.
    Google ScholarLocate open access versionFindings
  • Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel. Monotonic infinite lookback attention for simultaneous machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1313–1323, Florence, Italy, July 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P19-1126.
    Locate open access versionFindings
  • Loıc Barrault, Ondrej Bojar, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Muller, Santanu Pal, Matt Post, and Marcos Zampieri. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1–61, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5301. URL https://www.aclweb.org/anthology/W19-5301.
    Locate open access versionFindings
  • Mauro Cettolo, Niehues Jan, Stuker Sebastian, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. The iwslt 2016 evaluation campaign. In International Workshop on Spoken Language Translation, 2016.
    Google ScholarLocate open access versionFindings
  • Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, et al. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 76–86, 2018.
    Google ScholarLocate open access versionFindings
  • Chung-Cheng Chiu and Colin Raffel. Monotonic chunkwise attention. 2018. URL https://openreview.net/pdf?id=Hko85plCW.
    Findings
  • Kyunghyun Cho and Masha Esipova. Can neural machine translation do simultaneous translation? arXiv preprint arXiv:1606.02012, 2016.
    Findings
  • Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan Vogel. Incremental decoding and training methods for simultaneous translation in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 493–499, New Orleans, Louisiana, June 201Association for Computational Linguistics. doi: 10.18653/v1/N18-2079. URL https://www.aclweb.org/anthology/N18-2079.
    Locate open access versionFindings
  • Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal Daume III. Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation. In Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP), pp. 1342– 1352, 2014.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor OK Li. Learning to translate in real-time with neural machine translation. In 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, pp. 1053–1062. Association for Computational Linguistics (ACL), 2017.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P07-2045.
    Locate open access versionFindings
  • Dieterich Lawson, Chung-Cheng Chiu, George Tucker, Colin Raffel, Kevin Swersky, and Navdeep Jaitly. Learning hard alignments with variational inference. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5799–5803. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • Yuping Luo, Chung-Cheng Chiu, Navdeep Jaitly, and Ilya Sutskever. Learning online alignments with continuous rewards policy gradient. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2801–2805. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Minh-Thang Luong and Christopher D Manning. Stanford neural machine translation systems for spoken language domains. 2015.
    Google ScholarFindings
  • Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
    Findings
  • Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3025–3036, Florence, Italy, July 2019. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P19-1289.
    Locate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
    Google ScholarLocate open access versionFindings
  • Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191, Belgium, Brussels, October 20Association for Computational Linguistics. URL https://www.aclweb.org/anthology/ W18-6319.
    Locate open access versionFindings
  • Colin Raffel, Minh-Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. Online and linear-time attention by enforcing monotonic alignments. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2837–2846. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, 2016.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. Simpler and faster learning of adaptive policies for simultaneous translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1349–1354, Hong Kong, China, November 2019a. Association for Computational Linguistics. doi: 10.18653/v1/D19-1137. URL https://www.aclweb.org/anthology/D19-1137.
    Locate open access versionFindings
  • Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. Simultaneous translation with flexible policy via restricted imitation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5816–5822, Florence, Italy, July 2019b. Association for Computational Linguistics. doi: 10.18653/v1/P19-1582. URL https://www.aclweb.org/anthology/P19-1582.
    Locate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments