# Parallel Machine Translation with Disentangled Context Transformer

international conference on machine learning, 2020.

Weibo:

Abstract:

State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an atten...More

Code:

Data:

Introduction

- State-of-the-art neural machine translation systems use autoregressive decoding where words are predicted one-byone conditioned on all previous words (Bahdanau et al, 2015; Vaswani et al, 2017).
- One way to remedy this fundamental problem is to refine model output iteratively (Lee et al, 2018; Ghazvininejad et al, 2019).
- This work pursues this iterative approach to non-autoregressive translation.1.
- Unlike the masked language models (Devlin et al, 2019; Ghazvininejad et al, 2019) where the model only predicts the masked words, the DisCo transformer can predict all words simultaneously, leading to faster inference as well as a substantial performance gain when training data are relatively large

Highlights

- State-of-the-art neural machine translation systems use autoregressive decoding where words are predicted one-byone conditioned on all previous words (Bahdanau et al, 2015; Vaswani et al, 2017)
- We introduce a new inference algorithm for iterative parallel decoding, parallel easy-first, where each word is predicted by attending to the words that the model is more confident about
- We propose a Disentangled Context objective as an efficient alternative to masked language modeling and design an architecture that can compute the objective in a single pass
- We demonstrate that our Disentangled Context transformer with the parallel easy-first inference achieves comparable performance to, if not better than, prior work on non-autoregressive machine translation with substantial reduction in the number of sequential steps of transformer computation
- Our Disentangled Context transformer with the parallel easy-first inference achieves at least comparable performance to the conditional masked language model with 10 steps despite the significantly fewer steps on average (e.g. 4.82 steps in en→de)
- We presented the Disentangled Context transformer that predicts every word in a sentence conditioned on an arbitrary subset of the other words

Methods

- The authors conduct extensive experiments on standard machine translation benchmarks.
- The authors demonstrate that the DisCo transformer with the parallel easy-first inference achieves comparable performance to, if not better than, prior work on non-autoregressive machine translation with substantial reduction in the number of sequential steps of transformer computation.
- The authors find that the DisCo transformer achieves more pronounced improvement when bitext training data are large, getting close to the performance of autoregressive models

Results

**Results and Discussion**

Seen in Table 1 are the results in the four directions from the WMT’14 EN-DE and WMT’16 EN-RO datasets.- The authors' re-implementations of CMLM + Mask-Predict outperform Ghazvininejad et al (2019) (e.g. 31.24 vs 30.53 in de→en with 10 steps)
- This is probably due to the tuning on the dropout rate and weight averaging of the 5 best epochs based on the validation BLEU performance (Sec. 4.1).
- The authors' DisCo transformer with the parallel easy-first inference achieves at least comparable performance to the CMLM with 10 steps despite the significantly fewer steps on average (e.g. 4.82 steps in en→de).
- Each iteration in inference on LevT involves three sequential transformer runs, which undermine the latency improvement

Conclusion

- The authors presented the DisCo transformer that predicts every word in a sentence conditioned on an arbitrary subset of the other words.
- The authors developed an inference algorithm that takes advantage of this efficiency and further speeds up generation without loss in translation quality.
- The authors' results provide further support for the claim that non-autoregressive translation is a fast viable alternative to autoregressive translation.
- A discrepancy still remains between autoregressive and non-autoregressive performance when knowledge distillation from a large transformer is applied to both.
- The authors will explore ways to narrow this gap in the future

Summary

## Introduction:

State-of-the-art neural machine translation systems use autoregressive decoding where words are predicted one-byone conditioned on all previous words (Bahdanau et al, 2015; Vaswani et al, 2017).- One way to remedy this fundamental problem is to refine model output iteratively (Lee et al, 2018; Ghazvininejad et al, 2019).
- This work pursues this iterative approach to non-autoregressive translation.1.
- Unlike the masked language models (Devlin et al, 2019; Ghazvininejad et al, 2019) where the model only predicts the masked words, the DisCo transformer can predict all words simultaneously, leading to faster inference as well as a substantial performance gain when training data are relatively large
## Methods:

The authors conduct extensive experiments on standard machine translation benchmarks.- The authors demonstrate that the DisCo transformer with the parallel easy-first inference achieves comparable performance to, if not better than, prior work on non-autoregressive machine translation with substantial reduction in the number of sequential steps of transformer computation.
- The authors find that the DisCo transformer achieves more pronounced improvement when bitext training data are large, getting close to the performance of autoregressive models
## Results:

**Results and Discussion**

Seen in Table 1 are the results in the four directions from the WMT’14 EN-DE and WMT’16 EN-RO datasets.- The authors' re-implementations of CMLM + Mask-Predict outperform Ghazvininejad et al (2019) (e.g. 31.24 vs 30.53 in de→en with 10 steps)
- This is probably due to the tuning on the dropout rate and weight averaging of the 5 best epochs based on the validation BLEU performance (Sec. 4.1).
- The authors' DisCo transformer with the parallel easy-first inference achieves at least comparable performance to the CMLM with 10 steps despite the significantly fewer steps on average (e.g. 4.82 steps in en→de).
- Each iteration in inference on LevT involves three sequential transformer runs, which undermine the latency improvement
## Conclusion:

The authors presented the DisCo transformer that predicts every word in a sentence conditioned on an arbitrary subset of the other words.- The authors developed an inference algorithm that takes advantage of this efficiency and further speeds up generation without loss in translation quality.
- The authors' results provide further support for the claim that non-autoregressive translation is a fast viable alternative to autoregressive translation.
- A discrepancy still remains between autoregressive and non-autoregressive performance when knowledge distillation from a large transformer is applied to both.
- The authors will explore ways to narrow this gap in the future

- Table1: The performance of non-autoregressive machine translation methods on the WMT’14 EN-DE and WMT’16 EN-RO test data. The Step columns indicate the average number of sequential transformer passes. Shaded results use a small transformer (dmodel = dhidden = 512). Our EN-DE results in parentheses show the scores after conventional compound splitting (<a class="ref-link" id="cLuong_et+al_2015_a" href="#rLuong_et+al_2015_a">Luong et al, 2015</a>; <a class="ref-link" id="cVaswani_et+al_2017_a" href="#rVaswani_et+al_2017_a">Vaswani et al, 2017</a>). Unfortunately, we lack consensus in evaluation (<a class="ref-link" id="cPost_2018_a" href="#rPost_2018_a">Post, 2018</a>)
- Table2: WMT’17 EN-ZH test results
- Table3: WMT’14 EN-FR test results
- Table4: Effects of distillation across different models and inference. All results are from the corresponding dev data. T and b denote the max number of iterations and beam size respectively
- Table5: Test results from AT with contextless keys and values
- Table6: Dev results from bringing training closer to inference
- Table7: Dev results with different decoding strategies

Reference

- Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L. S., and Auli, M. Cloze-driven pretraining of self-attention networks, 2019. URL https://arxiv.org/abs/1903.07785.
- Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR, 2015. URL https://arxiv.org/pdf/1409.0473.pdf.
- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, 2019. URL https://arxiv.org/abs/810.04805.
- Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. Convolutional sequence to sequence learning. In Proc. of ICML, 2017. URL https://arxiv.org/abs/1705.03122.
- Ghazvininejad, M., Levy, O., Liu, Y., and Zettlemoyer, L. S. Mask-predict: Parallel decoding of conditional masked language models. In Proc. of EMNLP, 2019. URL https://arxiv.org/abs/1904.09324.
- Gu, J., Bradbury, J., Xiong, C., Li, V. O. K., and Socher, R. Non-autoregressive neural machine translation. In Proc. of ICLR, 2018. URL https://arxiv.org/abs/1711.02281.
- Gu, J., Wang, C., and Zhao, J. Levenshtein transformer. In Proc. of NeurIPS, 2019. URL https://arxiv.org/abs/1905.11006.
- Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T.-Y., Luo, R., Menezes, A., Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z., and Zhou, M. Achieving human parity on automatic chinese to english news translation, 201URL https://arxiv.org/abs/1803.05567.
- Kaiser, L., Roy, A., Vaswani, A., Parmar, N., Bengio, S., Uszkoreit, J., and Shazeer, N. Fast decoding in sequence models using discrete latent variables. In Proc. of ICML, 2018. URL https://arxiv.org/abs/1803.03382.
- Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. In Proc. of EMNLP, 2016. URL https://arxiv.org/abs/1606.07947.
- Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015. URL https://arxiv.org/abs/1412.6980.
- Lee, J. D., Mansimov, E., and Cho, K. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proc. of EMNLP, 2018. URL https://arxiv.org/abs/1802.06901.
- Li, Z., Lin, Z., He, D., Tian, F., Qin, T., Wang, L., and Liu, T.-Y. Hint-based training for non-autoregressive machine translation. In Proc. of EMNLP, 2019. URL https://arxiv.org/abs/1909.06708.
- Libovicky, J. and Helcl, J. End-to-end non-autoregressive neural machine translation with connectionist temporal classification. In Proc. of EMNLP, 2018. URL https://arxiv.org/abs/1811.04719.
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. S., and Stoyanov, V. RoBERTa: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692.
- Luong, T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation. In Proc. of EMNLP, September 2015. URL https://www.aclweb.org/anthology/D15-1166.
- Ma, X., Zhou, C., Li, X., Neubig, G., and Hovy, E. H. FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In Proc. of EMNLP, 2019. URL https://arxiv.org/abs/1909.02480.
- Mansimov, E., Wang, A., and Cho, K. A generalized framework of sequence generation with application to undirected sequence models, 2019. URL https://arxiv.org/abs/1905.12790.
- Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. In Proc. of ICLR, 2018. URL https://arxiv.org/abs/1710.03740.
- Nakayama, S., Kano, T., Tjandra, A., Sakti, S., and Nakamura, S. Recognition and translation of codeswitching speech utterances. In Proc. of Oriental COCOSDA, 2019. URL https://ahcweb01.naist.jp/papers/conference/2019/201910_ OCOCOSDA_sahoko-n/201910_OCOCOSDA_sahoko-n.paper.pdf.
- Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling neural machine translation. In Proc. of WMT, 2018. URL https://arxiv.org/abs/1806.00187.
- Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. In Proc. of ACL, 2002. URL https://www.aclweb.org/anthology/P02-1040.pdf.
- Post, M. A call for clarity in reporting BLEU scores. In Proc. of WMT, 2018. URL https://www.aclweb.org/anthology/W18-6319.
- Ran, Q., Lin, Y., Li, P., and Zhou, J. Guiding nonautoregressive neural machine translation decoding with reordering information, 2019. URL https://arxiv.org/abs/1911.02215.
- Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proc. of ACL, 2016. URL https://www.aclweb.org/anthology/P16-1162. In Proc.of ICLR, 2019. URL https://arxiv.org/abs/1901.10430.
- Yang, B., Liu, F., and Zou, Y. Non-autoregressive video captioning with iterative refinement, 2019a. URL https://arxiv.org/abs/1911.12018.
- Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdinov, R., and Le, Q. V. XLNet: Generalized autoregressive pretraining for language understanding. In Proc. of NeurIPS, 2019b. URL https://arxiv.org/pdf/1906.08237.pdf.
- Zhou, C., Neubig, G., and Gu, J. Understanding knowledge distillation in non-autoregressive machine translation, 2020. URL https://arxiv.org/abs/1911.02727.
- Shao, C., Zhang, J., Feng, Y., Meng, F., and Zhou, J. Minimizing the bag-of-ngrams difference for nonautoregressive neural machine translation. In Proc. of AAAI, 2020. URL https://arxiv.org/abs/1911.09320.
- Shu, R., Lee, J., Nakayama, H., and Cho, K. Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. In Proc. of AAAI, 2020. URL https://arxiv.org/abs/1908.07181.
- Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. In Proc. of NeurIPS, 2018. URL https://arxiv.org/abs/1811.03115.
- Stern, M., Chan, W., Kiros, J. R., and Uszkoreit, J. Insertion transformer: Flexible sequence generation via insertion operations. In Proc. of ICML, 2019. URL https://arxiv.org/abs/1902.03249.
- Sun, Z., Li, Z., Wang, H., He, D., Lin, Z., and Deng, Z. Fast structured decoding for sequence models. In Proc. of NeurIPS, 2019. URL https://arxiv.org/abs/1910.11555.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Proc. of NeurIPS, 2017. URL https://arxiv.org/pdf/1706.03762.pdf.
- Wang, Y., Tian, F., He, D., Qin, T., Zhai, C., and Liu, T.-Y. Non-autoregressive machine translation with auxiliary regularization. In Proc. of AAAI, 2019. URL https://arxiv.org/abs/1902.10245.

Full Text

Tags

Comments