While word embeddings play an important role in neural network-based models in natural language processing and achieve great success, one technical challenge is that the embeddings of rare words are difficult to train due to their low frequency of occurrences. develops a novel wa...
FRAGE: Frequency-Agnostic Word Representation.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), (2020): 1334-1345
Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased...更多
下载 PDF 全文
- Word embeddings, which are distributed and continuous vector representations for word tokens, have been one of the basic building blocks for many neural network-based models used in natural language processing (NLP) tasks, such as language modeling [18, 16], text classification [24, 7] and machine translation [4, 5, 40, 38, 11].
- Adversarial training has been successfully applied to NLP tasks [6, 22, 21]. [6, 22] introduce an additional discriminator to differentiate the semantics learned from different languages in non-parallel bilingual data.  develops a discriminator to classify whether a sentence is created by human or generated by a model
- Word embeddings, which are distributed and continuous vector representations for word tokens, have been one of the basic building blocks for many neural network-based models used in natural language processing (NLP) tasks, such as language modeling [18, 16], text classification [24, 7] and machine translation [4, 5, 40, 38, 11]
- For a given natural language processing task, in addition to minimizing the task-specific loss by optimizing the task-specific parameters together with word embeddings, we introduce another discriminator, which takes a word embedding as input and classifies whether it is a popular/rare word
- While word embeddings play an important role in neural network-based models in natural language processing and achieve great success, one technical challenge is that the embeddings of rare words are difficult to train due to their low frequency of occurrences.  develops a novel way to split each word into sub-word units which is widely used in neural machine translation
- Adversarial training has been successfully applied to natural language processing tasks [6, 22, 21]. [6, 22] introduce an additional discriminator to differentiate the semantics learned from different languages in non-parallel bilingual data.  develops a discriminator to classify whether a sentence is created by human or generated by a model
- We find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space
- Simple yet effective adversarial training method to improve the model performance which is verified in a wide range of tasks
- The authors present the method to improve word representations.
- The authors hope the learned word embeddings not only minimize the task-specific training loss but fool the discriminator.
- The frequency information is removed from the embedding and the authors call the method frequency-agnostic word embedding (FRAGE).
- The authors develop three types of notations: embeddings, task-specific parameters/loss, and discriminator parameters/loss.
- Transformer Base.
- Transformer Base with FRAGE 28.36 ConvS2S+Risk .
- 23.75 DeepConv ConvS2S
- The authors provide the experimental results of all tasks. For simplicity, the authors use “with FRAGE” as the proposed method in the tables.
Word Similarity The results on three word similarity tasks are listed in Table 1.
- The authors provide the experimental results of all tasks.
- The authors use “with FRAGE” as the proposed method in the tables.
- Word Similarity The results on three word similarity tasks are listed in Table 1.
- The authors outperform the baseline for about 5.4 points on the rare word dataset RW.
- This result shows that the method improves the representation of words, especially the rare words
- The authors have strong evidence that the current phenomena are problematic. First, according to the study,3 in both tasks, more than half of the rare words are nouns, e.g., company names, city names.
- It is not good to use such word embeddings into semantic understanding tasks, e.g., text classification, language modeling, language understanding, and translation.In this paper, the authors find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space.
- The authors will study more applications which have the similar problem even beyond NLP
- Table1: Results on three word similarity datasets
- Table2: Perplexity on validation and test sets on Penn Treebank and WikiText2. Smaller the perplexity, better the result. Baseline results are obtained from [<a class="ref-link" id="c25" href="#r25">25</a>, <a class="ref-link" id="c41" href="#r41">41</a>]. “Paras” denotes the number of model parameters
- Table3: BLEU scores on test set on WMT2014 English-German and IWSLT German-English tasks
- Table4: Accuracy on test sets of AG’s news corpus (AG’s), IMDB movie review dataset (IMDB) and 20 Newsgroups (20NG) for text classification
- Table5: Case study for the original model and our method. Rare words are marked by “*”. For each word, we list its model-predicted neighbors. Moreover, we also show the ranking positions of the semantic neighbors based on cosine similarity. As we can see, the ranking positions of the semantic neighbors are very low for the original model
- This work is supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), NSFC (61573026) and BJNSF (L172037) and a grant from Microsoft Research Asia
- R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual nlp. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183–192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.
- M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. 2016.
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without parallel data. arXiv preprint arXiv:1710.04087, 2017.
- A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pages 3079–3087, 2015.
- S. Edunov, M. Ott, M. Auli, D. Grangier, and M. Ranzato. Classical structured prediction losses for sequence to sequence learning. arXiv preprint arXiv:1711.04956, 2017.
- Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180–1189, 2015.
- J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344, 2016.
- J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
- C. Gong, D. He, X. Tan, T. Qin, L. Wang, and T. Liu. FRAGE: frequency-agnostic word representation. CoRR, abs/1809.06858, 2018.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous cache. CoRR, abs/1612.04426, 2016.
- E. Hoffer, I. Hubara, and D. Soudry. Fix your classifier: the marginal value of training the last weight layer. ICLR, 2018.
- R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
- Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. Character-aware neural language models. In AAAI, pages 2741–2749, 2016.
- B. Krause, E. Kahembwe, I. Murray, and S. Renals. Dynamic evaluation of neural sequence models. CoRR, abs/1709.07432, 2017.
- S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for text classification. In AAAI, volume 333, pages 2267–2273, 2015.
- A. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.
- G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.
- R. R. Larson. Introduction to information retrieval. Journal of the American Society for Information Science and Technology, 61(4):852–853, 2010.
- A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. Association for Computational Linguistics, 2011.
- S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing LSTM language models. CoRR, abs/1708.02182, 2017.
- S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016.
- T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045–1048, 2010.
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
- J. Mu, S. Bhat, and P. Viswanath. All-but-the-top: simple and effective postprocessing for word representations. arXiv preprint arXiv:1702.01417, 2017.
- J. Mu, S. Bhat, and P. Viswanath. All-but-the-top: Simple and effective postprocessing for word representations. CoRR, abs/1702.01417, 2017.
- M. Ott, M. Auli, D. Granger, and M. Ranzato. Analyzing uncertainty in neural machine translation. arXiv preprint arXiv:1803.00047, 2018.
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543, 2014.
- A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
- R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2962–2971, 2017.
- A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
- Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and L. Tie-Yan. Dual transfer learning for neural machine translation with marginal distribution regularization.
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen. Breaking the softmax bottleneck: A high-rank RNN language model. CoRR, abs/1711.03953, 2017.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.