Sequence Generation with Mixed Representations

ICML, pp. 10388-10398, 2020.

Cited by: 0|Bibtex|Views116
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de
Weibo:
We propose to generate sequences with mixed representations by leveraging different subword tokenization methods

Abstract:

Tokenization is the first step of many natural language processing (NLP) tasks and plays an important role for neural NLP models. Tokenization methods such as byte-pair encoding and SentencePiece, which can greatly reduce the large vocabulary size and deal with out-of-vocabulary words, have shown to be effective and are widely adopted for...More

Code:

Data:

0
Introduction
Highlights
  • Natural language processing (NLP) has achieved great success with deep neural networks in recent years (Deng & Liu, 2018; Zhang et al, 2015; Deng et al, 2013; Wu et al, 2016; Hassan et al, 2018)
  • Byte-pair encoding (BPE) (Sennrich et al, 2015) constructs the vocabulary based on the subword frequency, and word level tokenization should be applied first before using Byte-pair encoding
  • Similar improvements are observed when using SP tokenizer. These results demonstrate that our method generally works across different sequence generation tasks
  • We can observe our model architecture achieves better performance with Byte-pair encoding and WP tokenizers over the baselines, which again demonstrates that leveraging multiple tokenizers to build mixed representations is helpful to improve the translation quality
  • We propose to generate sequences with mixed representations by leveraging different subword tokenization methods
  • The effectiveness of our approach is verified on machine translation task and abstractive summarization application
Methods
  • To evaluate the proposed model and training algorithm, the authors conduct experiments on two standard sequence generation tasks: machine translation and abstractive summarization.
  • Data The authors conduct experiments on standard translation tasks with multiple language pairs, which are English↔German (En↔De for short), English↔Dutch (En↔Nl for short), English↔Polish (En↔Pl for short), English↔Portuguese-Brazil (En↔Pt-br for short), English↔Turkish (En↔Tr for short), and English↔Romanian (En↔Ro for short) language pairs
  • These benchmark datasets all come from the widely acknowledged IWSLT-2014 machine translation (Cettolo et al, 2014) competition2.
  • The resulted datasets contains about 160k, 7k and 7k pairs for training, valid and test sets for En↔De task, 180k, 4.7k, 1.1k for En↔Ro task, 170k, 4.5k, 1.1k for En↔Nl task, 175k, 4.5k, 1.2k for En↔Pt-br task, 181k, 4.7k, 1.2k for En↔Pl task and 160k, 4.5k, 1k for En↔Tr task respectively
Results
  • Ori und diese einfachen themen sind eigentlich keine komplexen wissenschaftlichen zusammenhange und diese einfachen themen sind eigentlich keine BPE komplex@@ en wissenschaftlichen zusammen@@ han@@ ge

    WP und diese einfachen themen sind eigentlich keine komplexe n wissenschaftlichen zusammen hange

    SP und diese einfachen them en sind eigentlich keine komplexen wissenschaft lichen zusammenhange individual characters.
  • Compared with Transformer 512, both the BPE decoded and SP decoded results can outperform the corresponding baselines by 0.5 points for most tasks.
  • Similar improvements are observed when using SP tokenizer
  • These results demonstrate that the method generally works across different sequence generation tasks.
  • 8. The authors can observe the model architecture achieves better performance with BPE and WP tokenizers over the baselines, which again demonstrates that leveraging multiple tokenizers to build mixed representations is helpful to improve the translation quality
Conclusion
  • The authors propose to generate sequences with mixed representations by leveraging different subword tokenization methods.
  • The authors introduce a new model structure to incorporate mixed representations from different tokenization methods, and a co-teaching algorithm to better utilize the diversity and advantage of each individual tokenization method.
  • The authors will apply the algorithm to more sequence learning applications, like text classification, natural language understanding, etc.
  • The authors will study to extend the model with more subword tokenization methods
Summary
  • Introduction:

    Natural language processing (NLP) has achieved great success with deep neural networks in recent years (Deng & Liu, 2018; Zhang et al, 2015; Deng et al, 2013; Wu et al, 2016; Hassan et al, 2018).
  • For neural based NLP models, tokenization, which chops raw sequence up into pieces, is the first step and plays the most important role in the text preprocessing.
  • Tokenization is always performed on word level (Arppe et al, 2005; Bahdanau et al, 2014), which splits a raw sentence by spaces and applies language-specific rules to punctuation marks, or character level (Kim et al, 2016; Lee et al, 2017), which directly segments words into
  • Methods:

    To evaluate the proposed model and training algorithm, the authors conduct experiments on two standard sequence generation tasks: machine translation and abstractive summarization.
  • Data The authors conduct experiments on standard translation tasks with multiple language pairs, which are English↔German (En↔De for short), English↔Dutch (En↔Nl for short), English↔Polish (En↔Pl for short), English↔Portuguese-Brazil (En↔Pt-br for short), English↔Turkish (En↔Tr for short), and English↔Romanian (En↔Ro for short) language pairs
  • These benchmark datasets all come from the widely acknowledged IWSLT-2014 machine translation (Cettolo et al, 2014) competition2.
  • The resulted datasets contains about 160k, 7k and 7k pairs for training, valid and test sets for En↔De task, 180k, 4.7k, 1.1k for En↔Ro task, 170k, 4.5k, 1.1k for En↔Nl task, 175k, 4.5k, 1.2k for En↔Pt-br task, 181k, 4.7k, 1.2k for En↔Pl task and 160k, 4.5k, 1k for En↔Tr task respectively
  • Results:

    Ori und diese einfachen themen sind eigentlich keine komplexen wissenschaftlichen zusammenhange und diese einfachen themen sind eigentlich keine BPE komplex@@ en wissenschaftlichen zusammen@@ han@@ ge

    WP und diese einfachen themen sind eigentlich keine komplexe n wissenschaftlichen zusammen hange

    SP und diese einfachen them en sind eigentlich keine komplexen wissenschaft lichen zusammenhange individual characters.
  • Compared with Transformer 512, both the BPE decoded and SP decoded results can outperform the corresponding baselines by 0.5 points for most tasks.
  • Similar improvements are observed when using SP tokenizer
  • These results demonstrate that the method generally works across different sequence generation tasks.
  • 8. The authors can observe the model architecture achieves better performance with BPE and WP tokenizers over the baselines, which again demonstrates that leveraging multiple tokenizers to build mixed representations is helpful to improve the translation quality
  • Conclusion:

    The authors propose to generate sequences with mixed representations by leveraging different subword tokenization methods.
  • The authors introduce a new model structure to incorporate mixed representations from different tokenization methods, and a co-teaching algorithm to better utilize the diversity and advantage of each individual tokenization method.
  • The authors will apply the algorithm to more sequence learning applications, like text classification, natural language understanding, etc.
  • The authors will study to extend the model with more subword tokenization methods
Tables
  • Table1: A German sentence example (Ori stands for original) of tokenization results by BPE, WordPiece (WP), SentencePiece (SP) tokenizers. Different subwords are highlighted in bold font. @@ and represent the boundaries of subwords
  • Table2: Performance of BPE, WordPiece (WP), SentencePiece (SP) tokenizers on different IWSLT translation tasks. Details of experiment settings are left in Section 4.1
  • Table3: Machine translation results of our model and the standard Transformer on various IWSLT-2014 translation datasets. “Transformer 512” and “Transformer 256” refer to the baseline models with embedding dimension 512 and 256. Our models is equipped with embedding dimension 256. The numbers in bold font stand for results that are significantly better than Transformer 512 results with p-value less than 0.05 (<a class="ref-link" id="cKoehn_2004_a" href="#rKoehn_2004_a">Koehn, 2004</a>)
  • Table4: Comparison with existing works on IWSLT De→En translation tasks
  • Table5: Results of abstractive summarization
  • Table6: Results of stacked representations on En↔De translations
  • Table7: Results of co-teaching with only one type of tokenization on En↔De translations
  • Table8: Results of our model with BPE and WP tokenizers on En↔De translations
Download tables as Excel
Related work
  • In this section, we first introduce the background of several tokenization approaches. Then we introduce some recent works of leveraging different tokenizers.

    2.1. Tokenization Approaches

    We describe the details of different tokenization approaches here. BPE (Sennrich et al, 2015) tokenizer initializes the vocabulary with all the characters and builds the final vocabulary by iteratively merging frequent n-gram characters. Similarly, WordPiece (Schuster & Nakajima, 2012) also constructs the vocabulary from characters. Different from BPE, WordPiece forms a new subword according to the n-gram likelihood on the training data instead of the next highest frequency pair. SentencePiece (Kudo, 2018; Kudo & Richardson, 2018) is based on the assumption that all subword occurrences are independent and a tokenized sequence is produced by the product of subword occurrence probabilities. Therefore, SentencePiece selects and builds the subword dictionary on the word occurrence and the loss of each subword. Both WordPiece and SentencePiece leverage language models to build their vocabularies.
Reference
  • Arppe, A., Carlson, L., Linden, K., Piitulainen, J. O., Suominen, M., Vainio, M., Westerlund, H., and Yli-Jyra, A. M. Inquiries into words, constraints and contexts: Festschrift in the honour of Kimmo Koskenniemi on his 60th birthday. CSLI Publications, 2005.
    Google ScholarFindings
  • Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. An actorcritic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
    Findings
  • Cettolo, M., Niehues, J., Stuker, S., Bentivogli, L., and Federico, M. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, volume 57, 2014.
    Google ScholarLocate open access versionFindings
  • Cherry, C., Foster, G., Bapna, A., Firat, O., and Macherey, W. Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4295–4305, 2018.
    Google ScholarLocate open access versionFindings
  • Deng, L. and Liu, Y. Deep learning in natural language processing. Springer, 2018.
    Google ScholarFindings
  • Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al. Recent advances in deep learning for speech research at microsoft. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8604–860IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • Fonollosa, J. A., Casas, N., and Costa-jussa, M. R. Joint source-target self attention with locality constraints. arXiv preprint arXiv:1905.06596, 2019.
    Findings
  • Graff, D. and Cieri, C. English gigaword, linguistic data consortium, 2003.
    Google ScholarFindings
  • Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., et al. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567, 2018.
    Findings
  • Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Findings
  • Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. Characteraware neural language models. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Koehn, P. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 388–395, 2004.
    Google ScholarLocate open access versionFindings
  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp. 177–180, 2007.
    Google ScholarLocate open access versionFindings
  • Kudo, T. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959, 2018.
    Findings
  • Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
    Findings
  • Larsson, G., Maire, M., and Shakhnarovich, G. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
    Findings
  • Lee, J., Cho, K., and Hofmann, T. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics, 5:365–378, 2017.
    Google ScholarLocate open access versionFindings
  • Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
    Google ScholarFindings
  • Rush, A. M., Chopra, S., and Weston, J. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 379–389, 2015.
    Google ScholarLocate open access versionFindings
  • Ling, W., Trancoso, I., Dyer, C., and Black, A. W. Character-based neural machine translation. arXiv preprint arXiv:1511.04586, 2015.
    Findings
  • Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., and Liu, T.-Y. Understanding and improving transformer from a multi-particle dynamic system point of view. arXiv preprint arXiv:1906.02762, 2019.
    Findings
  • Luong, M.-T. and Manning, C. D. Achieving open vocabulary neural machine translation with hybrid wordcharacter models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1054–1063, 2016.
    Google ScholarLocate open access versionFindings
  • Medress, M. F., Cooper, F. S., Forgie, J. W., Green, C., Klatt, D. H., O’Malley, M. H., Neuburg, E. P., Newell, A., Reddy, D., Ritea, B., et al. Speech understanding systems: Report of a steering committee. Artificial Intelligence, 9 (3):307–316, 1977.
    Google ScholarLocate open access versionFindings
  • Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53, 2019.
    Google ScholarLocate open access versionFindings
  • Pan, Y., Li, X., Yang, Y., and Dong, R. Morphological word segmentation on agglutinative languages for neural machine translation. arXiv preprint arXiv:2001.01589, 2020.
    Findings
  • Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
    Google ScholarLocate open access versionFindings
  • Paulus, R., Xiong, C., and Socher, R. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
    Findings
  • Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
    Findings
  • Schuster, M. and Nakajima, K. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149– 5152. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
    Findings
  • Shen, S., Zhao, Y., Liu, Z., Sun, M., et al. Neural headline generation with sentence-wise optimization. arXiv preprint arXiv:1604.01904, 2016.
    Findings
  • Srinivasan, T., Sanabria, R., and Metze, F. Multitask learning for different subword segmentations in neural machine translation. arXiv preprint arXiv:1910.12368, 2019.
    Findings
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
    Google ScholarLocate open access versionFindings
  • Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
    Google ScholarLocate open access versionFindings
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Wang, D., Gong, C., and Liu, Q. Improving neural language modeling via adversarial training. In International Conference on Machine Learning, pp. 6555–6565, 2019a.
    Google ScholarLocate open access versionFindings
  • Wang, X., Pham, H., Arthur, P., and Neubig, G. Multilingual neural machine translation with soft decoupled encoding. arXiv preprint arXiv:1902.03499, 2019b.
    Findings
  • Wang, Y., Xia, Y., He, T., Tian, F., Qin, T., Zhai, C. X., and Liu, T. Y. Multi-agent dual learning. In 7th International Conference on Learning Representations, ICLR 2019, 2019c.
    Google ScholarLocate open access versionFindings
  • Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M. Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430, 2019.
    Findings
  • Provilkov, I., Emelianenko, D., and Voita, E. Bpe-dropout: Simple and effective subword regularization. arXiv preprint arXiv:1910.13267, 2019.
    Findings
  • Wu, L., Zhao, L., Qin, T., Lai, J., and Liu, T.-Y. Sequence prediction with unlabeled data by reward function learning. In IJCAI, 2017.
    Google ScholarLocate open access versionFindings
  • Wu, L., Tian, F., Zhao, L., Lai, J., and Liu, T.-Y. Word attention for sequence to sequence text understanding. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
    Findings
  • Zhang, X., Zhao, J., and LeCun, Y. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657, 2015.
    Google ScholarLocate open access versionFindings
  • Zhu, J., Gao, F., Wu, L., Xia, Y., Qin, T., Zhou, W., Cheng, X., and Liu, T.-Y. Soft contextual data augmentation for neural machine translation. arXiv preprint arXiv:1905.10523, 2019.
    Findings
  • Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., and Liu, T. Incorporating bert into neural machine translation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyl7ygStwB.
    Locate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments