Revisiting Self-Training for Neural Sequence Generation

Marc'Aurelio Ranzato
Marc'Aurelio Ranzato

ICLR, 2020.

Cited by: 3|Bibtex|Views76
EI
Other Links: academic.microsoft.com|dblp.uni-trier.de|arxiv.org
Weibo:
Experiments on machine translation and text summarization demonstrate the effectiveness of this approach in both low and high resource settings

Abstract:

Self-training is one of the earliest and simplest semi-supervised methods. The key idea is to augment the original labeled dataset with unlabeled data paired with the model's prediction (i.e. the pseudo-parallel data). While self-training has been extensively studied on classification problems, in complex sequence generation tasks (e.g. m...More

Code:

Data:

Introduction
  • Deep neural networks often require large amounts of labeled data to achieve good performance.
  • Acquiring labels is a costly process, which motivates research on methods that can effectively utilize unlabeled data to improve performance.
  • Towards this goal, semi-supervised learning (Chapelle et al, 2009) methods that take advantage of both labeled and unlabeled data are a natural starting point.
  • In the field of natural language processing, some early work have successfully applied self-training to word sense disambiguation (Yarowsky, 1995) and parsing (McClosky et al, 2006; Reichart & Rappoport, 2007; Huang & Harper, 2009)
Highlights
  • Deep neural networks often require large amounts of labeled data to achieve good performance
  • Designed for classification problems, common wisdom suggests that this method may be effective only when a good fraction of the predictions on unlabeled samples are correct, otherwise mistakes are going to be reinforced (Zhu & Goldberg, 2009)
  • We find that the separate training strategy with the whole pseudo parallel dataset (i.e. S = {(x, fθ(x))|x ∈ U }) produces better or equal performance for neural sequence generation while being simpler
  • To demonstrate the effect of smoothing on the fine-tuning step, we report test errors after fine-tuning
  • One natural question is whether we could further improve performance by encouraging even lower smoothness value, there is a clear trade-off, as a totally smooth model that outputs a constant value is a bad predictor
  • Experiments on machine translation and text summarization demonstrate the effectiveness of this approach in both low and high resource settings
Methods
  • PT FT baseline

    ST 16.8 17.9

    ST 16.5 17.5

    14.0 baseline iteration 1 iteration 2 iteration 3 on WMT100K.
  • “Scratch” denotes that the system is initialized randomly and trained from scratch, while Figure 1: BLEU on WMT100K dataset from the supervised base- “baseline” means it is initialized line and different self-training variants.
  • Methods smoothness symmetric error baseline
  • To verify this hypothesis more we work with the toy task of summing two integers in the range 0 to 99.
  • We perform self-training for one iteration on this toy sum dataset and initialize the model with the base model to rule out differences due to the initialization.
Results
  • C ADDITIONAL RESULTS ON THE TOY SUM DATASET

    We show the error heat maps of the entire data space on the toy sum datasets for the first two iterations.
  • C ADDITIONAL RESULTS ON THE TOY SUM DATASET.
  • We show the error heat maps of the entire data space on the toy sum datasets for the first two iterations.
  • The model at pseudo-training step is initialized as the model from last iteration to clearly examine how the decodings change due to injected noise.
  • As shown in Figure 5, for each iteration the pseudo-training step smooths the space and fine-tuning step benefits from it and greatly reduces the errors (a) baseline (b) noisy ST (PT, iter=1).
Conclusion
  • In this paper we revisit self-training for neural sequence generation, and show that it can be an effective method to improve generalization, when labeled data is scarce.
  • Through a comprehensive ablation analysis and synthetic experiments, we identify that noise injected during self-training plays a critical role for its success due to its smoothing effect.
  • To encourage this behaviour, we explicitly perturb the input to obtain a new variant of self-training, dubbed noisy selftraining.
  • Experiments on machine translation and text summarization demonstrate the effectiveness of this approach in both low and high resource settings
Summary
  • Introduction:

    Deep neural networks often require large amounts of labeled data to achieve good performance.
  • Acquiring labels is a costly process, which motivates research on methods that can effectively utilize unlabeled data to improve performance.
  • Towards this goal, semi-supervised learning (Chapelle et al, 2009) methods that take advantage of both labeled and unlabeled data are a natural starting point.
  • In the field of natural language processing, some early work have successfully applied self-training to word sense disambiguation (Yarowsky, 1995) and parsing (McClosky et al, 2006; Reichart & Rappoport, 2007; Huang & Harper, 2009)
  • Methods:

    PT FT baseline

    ST 16.8 17.9

    ST 16.5 17.5

    14.0 baseline iteration 1 iteration 2 iteration 3 on WMT100K.
  • “Scratch” denotes that the system is initialized randomly and trained from scratch, while Figure 1: BLEU on WMT100K dataset from the supervised base- “baseline” means it is initialized line and different self-training variants.
  • Methods smoothness symmetric error baseline
  • To verify this hypothesis more we work with the toy task of summing two integers in the range 0 to 99.
  • We perform self-training for one iteration on this toy sum dataset and initialize the model with the base model to rule out differences due to the initialization.
  • Results:

    C ADDITIONAL RESULTS ON THE TOY SUM DATASET

    We show the error heat maps of the entire data space on the toy sum datasets for the first two iterations.
  • C ADDITIONAL RESULTS ON THE TOY SUM DATASET.
  • We show the error heat maps of the entire data space on the toy sum datasets for the first two iterations.
  • The model at pseudo-training step is initialized as the model from last iteration to clearly examine how the decodings change due to injected noise.
  • As shown in Figure 5, for each iteration the pseudo-training step smooths the space and fine-tuning step benefits from it and greatly reduces the errors (a) baseline (b) noisy ST (PT, iter=1).
  • Conclusion:

    In this paper we revisit self-training for neural sequence generation, and show that it can be an effective method to improve generalization, when labeled data is scarce.
  • Through a comprehensive ablation analysis and synthetic experiments, we identify that noise injected during self-training plays a critical role for its success due to its smoothing effect.
  • To encourage this behaviour, we explicitly perturb the input to obtain a new variant of self-training, dubbed noisy selftraining.
  • Experiments on machine translation and text summarization demonstrate the effectiveness of this approach in both low and high resource settings
Tables
  • Table1: Test tokenized BLEU
  • Table2: Ablation study on WMT100K data. For ST and noisy ST, we initialize the model with the baseline and results are from one single iteration. Dropout is varied only in the PT step, while dropout is always applied in FT step. Different decoding methods refer to the strategy used to create the pseudo target. At test time we use beam search decoding for all models
  • Table3: Results on the toy sum dataset. For ST and noisy ST, smoothness (↓) and symmetric (↓) results are from the pseudo-training step, while test errors (↓) are from fine-tuning, all at the first iteration
  • Table4: Results on two machine translation datasets. For WMT100K, we use the remaining 3.8M English and German sentences from training data as unlabeled data for noisy ST and BT, respectively
  • Table5: Rouge scores on Gigaword datasets. For the 100K setting we use the remaining 3.7M training data as unlabeled instances for noisy ST and BT. In the 3.8M setting we use 4M unlabeled data for noisy ST. Stared entry (∗) denotes that the system uses a much larger dataset for pretraining
  • Table6: Results on WMT100K data. All results are from one single iteration. “Parallel + real/fake target” denotes the noise process applied on parallel data but using real/fake target in the “pseudotraining” step. “Mono + fake target” is the normal noisy self-training process described in previous sections
  • Table7: Ablation analysis on WMT100K dataset
Download tables as Excel
Related work
  • Self-training belongs to a broader class of “pseudo-label” semi-supervised learning approaches. These approaches all learn from pseudo labels assigned to unlabelled data, with different methods on how to assign such labels. For instance, co-training (Blum & Mitchell, 1998) learns models on two independent feature sets of the same data, and assigns confident labels to unlabeled data from one of the models. Co-training reduces modeling bias by taking into account confidence scores from two models. In the same spirit, democratic co-training (Zhou & Goldman, 2004) or tri-training (Zhou & Li, 2005) trains multiple models with different configurations on the same data feature set, and a subset of the models act as teachers for others.

    parallel baseline noisy ST, 100K mono + fake target noisy ST, 3.8M mono + fake target

    – 15.6 10.2 16.6 16.6 19.3 noisy ST, 100K parallel + real target 6.7 11.3 noisy ST, 100K parallel + fake target 10.4 16.0

    Another line of more recent work perturb the input or feature space of the student’s inputs as data augmentation techniques. Self-training with dropout or noisy self-training can be viewed as an instantiation of this. These approaches have been very successful on classification tasks (Rasmus et al, 2015; Miyato et al, 2017; Laine & Aila, 2017; Miyato et al, 2018; Xie et al, 2019) given that a reasonable amount of predictions of unlabeled data (at least the ones with high confidence) are correct, but their effect on language generation tasks is largely unknown and poorly understood because the pseudo language targets are often very different from the ground-truth labels. Recent work on sequence generation employs auxiliary decoders (Clark et al, 2018) when processing unlabeled data, overall showing rather limited gains.
Funding
  • Finds that the perturbation on the hidden states is critical for self-training to benefit from the pseudo-parallel data, which acts as a regularizer and forces the model to yield close predictions for similar unlabeled inputs
  • Proposes to inject noise to the input space, resulting in a “noisy” version of self-training
  • Finds that the decoding method to generate pseudo targets accounts for part of the improvement, but more importantly, the perturbation of hidden states – dropout – turns out to be a crucial ingredient to prevent self-training from falling into the same local optimum as the base model, and this is responsible for most of the gains
  • Finds that the separate training strategy with the whole pseudo parallel dataset )|x ∈ U }) produces better or equal performance for neural sequence generation while being simpler
Reference
  • Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100.
    Google ScholarLocate open access versionFindings
  • Olivier Chapelle and Alexander Zien. Semi-supervised classification by low density separation. In Proceedings of AISTATS, 2005.
    Google ScholarLocate open access versionFindings
  • Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
    Google ScholarLocate open access versionFindings
  • Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. Semi-supervised sequence modeling with cross-view training. In Proceedings of EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. In Proceedings of EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Proceedings of NeurIPS, 2005.
    Google ScholarLocate open access versionFindings
  • Francisco Guzman, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. The FLoRes evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english. In Proceedings of EMNLP, 2019.
    Google ScholarLocate open access versionFindings
  • Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
    Findings
  • Zhongqiang Huang and Mary Harper. Self-training pcfg grammars with latent annotations across languages. In Proceedings of EMNLP, 2009.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Proceedings of NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In Proceedings of ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, et al. Phrase-based & neural unsupervised machine translation. In Proceedings of EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, 2013.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
    Google ScholarFindings
  • David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of NAACL, 2006.
    Google ScholarLocate open access versionFindings
  • Yishu Miao and Phil Blunsom. Language as a latent variable: Discrete generative models for sentence compression. In Proceedings of EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semisupervised text classification. In Proceedings of ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
    Google ScholarLocate open access versionFindings
  • Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL (Demo Track), 2019.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL, 2002.
    Google ScholarLocate open access versionFindings
  • Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In Proceedings of NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • Roi Reichart and Ari Rappoport. Self-training for enhancement and domain adaptation of statistical parsers trained on small datasets. In Proceedings of ACL, 2007.
    Google ScholarLocate open access versionFindings
  • Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In Proceedings of EMNLP, 2015.
    Google ScholarLocate open access versionFindings
  • H Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11(3):363–371, 1965.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 86–96, 2015.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of ACL, 2016.
    Google ScholarLocate open access versionFindings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation. In Proceedings of ICML, 2019.
    Google ScholarLocate open access versionFindings
  • Nicola Ueffing. Using monolingual source-language data to improve mt performance. In IWSLT, 2006.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019.
    Findings
  • David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of ACL, 1995.
    Google ScholarLocate open access versionFindings
  • Pengcheng Yin, Chunting Zhou, Junxian He, and Graham Neubig. StructVAE: Tree-structured latent variable models for semi-supervised semantic parsing. In Proceedings of EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Jiajun Zhang and Chengqing Zong. Exploiting source-side monolingual data in neural machine translation. In Proceedings of EMNLP, 2016.
    Google ScholarLocate open access versionFindings
  • Yan Zhou and Sally Goldman. Democratic co-learning. In 16th IEEE International Conference on Tools with Artificial Intelligence, pp. 594–602. IEEE, 2004.
    Google ScholarLocate open access versionFindings
  • Zhi-Hua Zhou and Ming Li. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge & Data Engineering, (11):1529–1541, 2005.
    Google ScholarLocate open access versionFindings
  • Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments