AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
While word embeddings play an important role in neural network-based models in natural language processing and achieve great success, one technical challenge is that the embeddings of rare words are difficult to train due to their low frequency of occurrences. develops a novel wa...

FRAGE: Frequency-Agnostic Word Representation.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), (2020): 1334-1345

被引用98|浏览79
EI
下载 PDF 全文
引用
微博一下

摘要

Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased...更多

代码

数据

0
简介
  • Word embeddings, which are distributed and continuous vector representations for word tokens, have been one of the basic building blocks for many neural network-based models used in natural language processing (NLP) tasks, such as language modeling [18, 16], text classification [24, 7] and machine translation [4, 5, 40, 38, 11].
  • Adversarial training has been successfully applied to NLP tasks [6, 22, 21]. [6, 22] introduce an additional discriminator to differentiate the semantics learned from different languages in non-parallel bilingual data. [21] develops a discriminator to classify whether a sentence is created by human or generated by a model
重点内容
  • Word embeddings, which are distributed and continuous vector representations for word tokens, have been one of the basic building blocks for many neural network-based models used in natural language processing (NLP) tasks, such as language modeling [18, 16], text classification [24, 7] and machine translation [4, 5, 40, 38, 11]
  • For a given natural language processing task, in addition to minimizing the task-specific loss by optimizing the task-specific parameters together with word embeddings, we introduce another discriminator, which takes a word embedding as input and classifies whether it is a popular/rare word
  • While word embeddings play an important role in neural network-based models in natural language processing and achieve great success, one technical challenge is that the embeddings of rare words are difficult to train due to their low frequency of occurrences. [35] develops a novel way to split each word into sub-word units which is widely used in neural machine translation
  • Adversarial training has been successfully applied to natural language processing tasks [6, 22, 21]. [6, 22] introduce an additional discriminator to differentiate the semantics learned from different languages in non-parallel bilingual data. [21] develops a discriminator to classify whether a sentence is created by human or generated by a model
  • We find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space
  • Simple yet effective adversarial training method to improve the model performance which is verified in a wide range of tasks
方法
  • The authors present the method to improve word representations.
  • The authors hope the learned word embeddings not only minimize the task-specific training loss but fool the discriminator.
  • The frequency information is removed from the embedding and the authors call the method frequency-agnostic word embedding (FRAGE).
  • The authors develop three types of notations: embeddings, task-specific parameters/loss, and discriminator parameters/loss.
  • Transformer Base[38].
  • Transformer Base with FRAGE 28.36 ConvS2S+Risk [8].
  • 23.75 DeepConv[10] ConvS2S[11]
结果
  • The authors provide the experimental results of all tasks. For simplicity, the authors use “with FRAGE” as the proposed method in the tables.

    Word Similarity The results on three word similarity tasks are listed in Table 1.
  • The authors provide the experimental results of all tasks.
  • The authors use “with FRAGE” as the proposed method in the tables.
  • Word Similarity The results on three word similarity tasks are listed in Table 1.
  • The authors outperform the baseline for about 5.4 points on the rare word dataset RW.
  • This result shows that the method improves the representation of words, especially the rare words
结论
  • The authors have strong evidence that the current phenomena are problematic. First, according to the study,3 in both tasks, more than half of the rare words are nouns, e.g., company names, city names.
  • It is not good to use such word embeddings into semantic understanding tasks, e.g., text classification, language modeling, language understanding, and translation.In this paper, the authors find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space.
  • The authors will study more applications which have the similar problem even beyond NLP
表格
  • Table1: Results on three word similarity datasets
  • Table2: Perplexity on validation and test sets on Penn Treebank and WikiText2. Smaller the perplexity, better the result. Baseline results are obtained from [<a class="ref-link" id="c25" href="#r25">25</a>, <a class="ref-link" id="c41" href="#r41">41</a>]. “Paras” denotes the number of model parameters
  • Table3: BLEU scores on test set on WMT2014 English-German and IWSLT German-English tasks
  • Table4: Accuracy on test sets of AG’s news corpus (AG’s), IMDB movie review dataset (IMDB) and 20 Newsgroups (20NG) for text classification
  • Table5: Case study for the original model and our method. Rare words are marked by “*”. For each word, we list its model-predicted neighbors. Moreover, we also show the ranking positions of the semantic neighbors based on cosine similarity. As we can see, the ranking positions of the semantic neighbors are very low for the original model
Download tables as Excel
基金
  • This work is supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), NSFC (61573026) and BJNSF (L172037) and a grant from Microsoft Research Asia
引用论文
  • R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual nlp. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183–192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
    Findings
  • S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. 2016.
    Google ScholarFindings
  • D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    Findings
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
    Findings
  • A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without parallel data. arXiv preprint arXiv:1710.04087, 2017.
    Findings
  • A. M. Dai and Q. V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pages 3079–3087, 2015.
    Google ScholarLocate open access versionFindings
  • S. Edunov, M. Ott, M. Auli, D. Grangier, and M. Ranzato. Classical structured prediction losses for sequence to sequence learning. arXiv preprint arXiv:1711.04956, 2017.
    Findings
  • Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, pages 1180–1189, 2015.
    Google ScholarLocate open access versionFindings
  • J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344, 2016.
    Findings
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
    Findings
  • C. Gong, D. He, X. Tan, T. Qin, L. Wang, and T. Liu. FRAGE: frequency-agnostic word representation. CoRR, abs/1809.06858, 2018.
    Findings
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
    Google ScholarLocate open access versionFindings
  • E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous cache. CoRR, abs/1612.04426, 2016.
    Findings
  • E. Hoffer, I. Hubara, and D. Soudry. Fix your classifier: the marginal value of training the last weight layer. ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
    Findings
  • N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
    Findings
  • Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. Character-aware neural language models. In AAAI, pages 2741–2749, 2016.
    Google ScholarLocate open access versionFindings
  • B. Krause, E. Kahembwe, I. Murray, and S. Renals. Dynamic evaluation of neural sequence models. CoRR, abs/1709.07432, 2017.
    Findings
  • S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for text classification. In AAAI, volume 333, pages 2267–2273, 2015.
    Google ScholarLocate open access versionFindings
  • A. M. Lamb, A. G. A. P. GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.
    Google ScholarLocate open access versionFindings
  • G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.
    Findings
  • R. R. Larson. Introduction to information retrieval. Journal of the American Society for Information Science and Technology, 61(4):852–853, 2010.
    Google ScholarLocate open access versionFindings
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. Association for Computational Linguistics, 2011.
    Google ScholarLocate open access versionFindings
  • S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing LSTM language models. CoRR, abs/1708.02182, 2017.
    Findings
  • S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016.
    Findings
  • T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045–1048, 2010.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    Google ScholarLocate open access versionFindings
  • J. Mu, S. Bhat, and P. Viswanath. All-but-the-top: simple and effective postprocessing for word representations. arXiv preprint arXiv:1702.01417, 2017.
    Findings
  • J. Mu, S. Bhat, and P. Viswanath. All-but-the-top: Simple and effective postprocessing for word representations. CoRR, abs/1702.01417, 2017.
    Findings
  • M. Ott, M. Auli, D. Granger, and M. Ranzato. Analyzing uncertainty in neural machine translation. arXiv preprint arXiv:1803.00047, 2018.
    Findings
  • J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
    Findings
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
    Google ScholarLocate open access versionFindings
  • R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
    Findings
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2962–2971, 2017.
    Google ScholarLocate open access versionFindings
  • A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018.
    Findings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
    Google ScholarLocate open access versionFindings
  • Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and L. Tie-Yan. Dual transfer learning for neural machine translation with marginal distribution regularization.
    Google ScholarFindings
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
    Findings
  • Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen. Breaking the softmax bottleneck: A high-rank RNN language model. CoRR, abs/1711.03953, 2017.
    Findings
  • J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
    Findings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科