Pre-Trained Multi-View Word Embedding Using Two-Side Neural Network

AAAI, pp. 1982-1988, 2014.

Cited by: 30|Bibtex|Views85|Links
EI
Keywords:
multiple linear regressionweb searchWikipedia wikiann multilinear regressionMore(21+)
Weibo:
This paper presents a two-side neural network architecture that can integrate multiple word embeddings for very different application tasks, such as search ranking and semantic word relatedness measuring

Abstract:

A novel multimodel ensemble approach based on learning from data using the neural network (NN) technique is formulated and applied for improving 24-hour precipitation forecasts over the continental US. The developed nonlinear approach allowed us to account for nonlinear correlation between ensemble members and to produce “optimal” forecas...More

Code:

Data:

0
Introduction
  • Word embedding is a continuous-valued representation of the word.
  • Good word embedding is expressive and effective since it can represent a huge number of possible inputs using only a small number of variables, and help tackling the problem of the curse of dimensionality.
  • By representing each word with the learned continuous-valued variables, semantically related words can be close to each other.
  • The effectiveness of word embedding has been investigated in the literature, such as (Bengio, Courville, and Vincent 2013).
  • The word embedding techniques were extended to embed queries, documents, phrases, entities, etc. (Huang et al 2013; Mikolov et al 2013b), which could play a critical role in industry applications, such as large-scale web search and knowledge mining
Highlights
  • Word embedding is a continuous-valued representation of the word
  • The word embedding techniques were extended to embed queries, documents, phrases, entities, etc. (Huang et al 2013; Mikolov et al 2013b), which could play a critical role in industry applications, such as large-scale web search and knowledge mining
  • We show the results of combining these three word embeddings for specific tasks using the proposed multi-view word embedding framework compared with several baselines
  • This paper presents a two-side neural network architecture that can integrate multiple word embeddings for very different application tasks, such as search ranking and semantic word relatedness measuring
  • The network is able to be fine tuned for the tasks of interest, and output a unified word embedding for each of them
  • The performance of the concatenation and linear regression strategy, which may fail on some tasks, are not robust for combining word embeddings
Methods
  • 0.703 competitive in this task and outperforms the other baselines significantly; 4) the REG strategy performs worse than CBOW click and CBOW wiki, and fails to combine the different word embeddings.
  • This again indicates that the simple linear combination strategy is unreliable for word embedding combination.
Conclusion
  • This paper presents a two-side neural network architecture that can integrate multiple word embeddings for very different application tasks, such as search ranking and semantic word relatedness measuring.
  • The input word embeddings are pre-trained by adapting the existed word embedding algorithm for different data sources.
  • The performance of the concatenation and linear regression strategy, which may fail on some tasks, are not robust for combining word embeddings.
  • The future work may be to introduce more word embedding algorithms for adaptation and combination, and include more application tasks for verification
Summary
  • Introduction:

    Word embedding is a continuous-valued representation of the word.
  • Good word embedding is expressive and effective since it can represent a huge number of possible inputs using only a small number of variables, and help tackling the problem of the curse of dimensionality.
  • By representing each word with the learned continuous-valued variables, semantically related words can be close to each other.
  • The effectiveness of word embedding has been investigated in the literature, such as (Bengio, Courville, and Vincent 2013).
  • The word embedding techniques were extended to embed queries, documents, phrases, entities, etc. (Huang et al 2013; Mikolov et al 2013b), which could play a critical role in industry applications, such as large-scale web search and knowledge mining
  • Methods:

    0.703 competitive in this task and outperforms the other baselines significantly; 4) the REG strategy performs worse than CBOW click and CBOW wiki, and fails to combine the different word embeddings.
  • This again indicates that the simple linear combination strategy is unreliable for word embedding combination.
  • Conclusion:

    This paper presents a two-side neural network architecture that can integrate multiple word embeddings for very different application tasks, such as search ranking and semantic word relatedness measuring.
  • The input word embeddings are pre-trained by adapting the existed word embedding algorithm for different data sources.
  • The performance of the concatenation and linear regression strategy, which may fail on some tasks, are not robust for combining word embeddings.
  • The future work may be to introduce more word embedding algorithms for adaptation and combination, and include more application tasks for verification
Tables
  • Table1: Word embeddings in the lookup table trained by different data sources. Each column is the queried word followed by its 10 most similar words in the dictionary
  • Table2: NDCG performance of different methods for search ranking
  • Table3: Comparsion of different methods for word similarity measuring
Download tables as Excel
Related work
  • Our work is a combination of the word embeddings pretrained from multiple views. Word embedding

    Traditional one-hot word representation, in which the dimensionality of the word vector is the same as the size of the dictionary, suffers the data sparsity problem. Therefore, many researchers focus on representing a word using a continuous-valued low-dimensional feature vector (Dumais et al 1988; Brown et al 1992; Bengio et al 2003). Word embedding (Bengio et al 2003; Collobert and Weston 2008; Mnih and Hinton 2008) is one of the most popular word representations of this type. We refer to (Turian, Ratinov, and Bengio 2010) for a summarization of some popular word representation works. Recently, two quite efficient models, continuous bag-of-words (CBOW) and continuous skipgram (Skip-gram) are proposed in (Mikolov et al 2013a). High-quality word embeddings can be learned using these two models and the training process was further accelerated in (Mikolov et al 2013b) by sub-sampling the frequent words. Due to the efficiency and effectiveness of these two models, we use one of them, the CBOW model, as our baseline word embedding algorithm. We will show how to adapt it for training on various data sources, and then combine the learned word embeddings for different applications.
Funding
  • This work is partially supported by NBRPC 2011CB302400, NSFC 60975014, 61121002, JCYJ20120614152136201, NSFB 4102024
Reference
  • Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3:1137–1155.
    Google ScholarLocate open access versionFindings
  • Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1798–1828.
    Google ScholarLocate open access versionFindings
  • Bengio, Y. 2009. Learning deep architectures for ai. Foundations and trends in Machine Learning 2(1):1–127.
    Google ScholarLocate open access versionFindings
  • Brown, P. F.; Desouza, P. V.; Mercer, R. L.; Pietra, V. J. D.; and Lai, J. C. 1992. Class-based n-gram models of natural language. Computational Linguistics 18(4):467–479.
    Google ScholarLocate open access versionFindings
  • Collobert, R., and Weston, J. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, 160–167.
    Google ScholarLocate open access versionFindings
  • Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12:2493–2537.
    Google ScholarLocate open access versionFindings
  • Dhillon, P.; Foster, D. P.; and Ungar, L. H. 2011. Multi-view learning of word embeddings via cca. In Advances in Neural Information Processing Systems, 199–207.
    Google ScholarLocate open access versionFindings
  • Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Deerwester, S.; and Harshman, R. 198Using latent semantic analysis to improve access to textual information. In SIGCHI conference on Human factors in computing systems, 281–285.
    Google ScholarLocate open access versionFindings
  • Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; and Ruppin, E. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems 20(1):116–131.
    Google ScholarLocate open access versionFindings
  • Gabrilovich, E., and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In International Joint Conference on Artificial Intelligence, 1606–1611.
    Google ScholarLocate open access versionFindings
  • Huang, P.-S.; He, X.; Gao, J.; Deng, L.; Acero, A.; and Heck, L. 2013. Learning deep structured semantic models for web search using clickthrough data. In ACM international conference on Conference on Information & Knowledge Management, 2333–2338.
    Google ScholarLocate open access versionFindings
  • Jarvelin, K., and Kekalainen, J. 2000. Ir evaluation methods for retrieving highly relevant documents. In ACM SIGIR conference on Research and development in information retrieval, 41–48.
    Google ScholarLocate open access versionFindings
  • Krasnopolsky, V. M., and Lin, Y. 2012. A neural network nonlinear multimodel ensemble to improve precipitation forecasts over continental us. Advances in Meteorology 2012(doi:10.1155/2012/649450).
    Locate open access versionFindings
  • Kung, S.-Y., and Hwang, J.-N. 1998. Neural networks for intelligent multimedia processing. Proceedings of the IEEE 86(6):1244–1272.
    Google ScholarLocate open access versionFindings
  • Lin, D., and Wu, X. 2009. Phrase clustering for discriminative learning. In Joint Conference of the Annual Meeting of the ACL and the International Joint Conference on Natural Language Processing, 1030–1038.
    Google ScholarLocate open access versionFindings
  • Luo, Y.; Tao, D.; Xu, C.; Li, D.; and Xu, C. 2013a. Vectorvalued multi-view semi-supervised learning for multi-label image classification. In AAAI Conference on Artificial Intelligence, 647–653.
    Google ScholarLocate open access versionFindings
  • Luo, Y.; Tao, D.; Xu, C.; Liu, H.; and Wen, Y. 2013b. Multiview vector-valued manifold regularization for multilabel image classification. IEEE Transactions on Neural Networks and Learning Systems 24(5):709–722.
    Google ScholarLocate open access versionFindings
  • Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient estimation of word representations in vector space. In ICLR Workshop.
    Google ScholarLocate open access versionFindings
  • Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 3111–3119.
    Google ScholarLocate open access versionFindings
  • Mnih, A., and Hinton, G. 2007. Three new graphical models for statistical language modelling. In International Conference on Machine Learning, 641–648.
    Google ScholarLocate open access versionFindings
  • Mnih, A., and Hinton, G. E. 2008. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems, 1081–1088.
    Google ScholarLocate open access versionFindings
  • Montavon, G.; Orr, G. B.; and Muller, K.-R. 2012. Neural networks: tricks of the trade (2nd edition). Springer.
    Google ScholarFindings
  • Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. 2011. Multimodal deep learning. In International Conference on Machine Learning, 689–696.
    Google ScholarLocate open access versionFindings
  • Srivastava, N., and Salakhutdinov, R. 2012. Multimodal learning with deep boltzmann machines. In Advances in Neural Information Processing Systems, 2231–2239.
    Google ScholarLocate open access versionFindings
  • Strube, M., and Ponzetto, S. P. 2006. Wikirelate! computing semantic relatedness using wikipedia. In AAAI Conference on Artificial Intelligence, 1419–1424.
    Google ScholarLocate open access versionFindings
  • Turian, J.; Ratinov, L.; and Bengio, Y. 2010. Word representations: a simple and general method for semi-supervised learning. In Annual Meeting of the Association for Computational Linguistics, 384–394.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments