Word Alignment Modeling with Context Dependent Deep Neural Network.
ACL, pp.166-175, (2013)
In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbas...More
PPT (Upload PPT)
- Recent years research communities have seen a strong resurgent interest in modeling with deep neural networks.
- This trending topic, usually referred under the name Deep Learning, is started by ground-breaking papers such as (Hinton et al, 2006), in which innovative training procedures of deep structures are proposed.
- DNN did not achieve expected success until 2006, when researchers discovered a proper way to intialize and train the deep architectures, which contains two phases: layer-wise unsupervised pretraining and supervised fine tuning.
- For speech recognition, (Dahl et al, 2012) proposed context-dependent neural network with large vocabulary, which achieved 16.0% relative error reduction
- Recent years research communities have seen a strong resurgent interest in modeling with deep neural networks
- Inspired by successful previous works, we propose a new DNN-based word alignment method, which exploits contextual and semantic similarities between words
- We introduce the details of leveraging DNN for word alignment, including the details of our network structure in Section 4 and the training method in Section 5
- We explores applying deep neural network for word alignment task
- Our model integrates a multi-layer neural network into an HMM-like framework, where context dependent lexical translation score is computed by neural network, and distortion is modeled by a simple jump-distance scheme
- Our current model use rather simple distortions; it might be helpful to use more sophisticated model such as ITG (Wu, 1997), which can be modeled by Recursive Neural Networks (Socher et al, 2011)
- The authors use the manually aligned Chinese-English alignment corpus (Haghighi et al, 2009) which contains 491 sentence pairs as test set.
- The monolingual corpus to pre-train word embeddings are crawled from web, which amounts to about 1.1 billion unique sentences for English and about 300 million unique sentences for Chinese.
- Since classic HMM, IBM model 4 and the model are all uni-directional, the authors use the standard growdiag-final to generate bi-directional results for all models
- The authors explores applying deep neural network for word alignment task.
- The authors' model is discriminatively trained on bilingual corpus, while huge monolingual data is used to pre-train wordembeddings.
- Experiments on large-scale Chineseto-English task show that the proposed method produces better word alignment results, compared with both classic HMM model and IBM model 4.
- The authors will investigate more settings of different hyper-parameters in the model.
- The authors' current model use rather simple distortions; it might be helpful to use more sophisticated model such as ITG (Wu, 1997), which can be modeled by Recursive Neural Networks (Socher et al, 2011)
- Table1: Word alignment result. The first row and third row show baseline results obtained by classic HMM and IBM4 model. The second row and fourth row show results of the proposed model trained from HMM and IBM4 respectively
- Table2: Nearest neighbors of several words according to their embedding distance. LM shows neighbors of word embeddings trained by monolingual language model method; WA shows neighbors of word embeddings trained by our word alignment model
- Table3: Effect of different number of hidden layers. Two hidden layers outperform one hidden layer, while three hidden layers do not bring further improvement
- DNN with unsupervised pre-training was firstly introduced by (Hinton et al, 2006) for MNIST digit image classification problem, in which, RBM was introduced as the layer-wise pre-trainer. The layer-wise pre-training phase found a better local maximum for the multi-layer network, thus led to improved performance. (Krizhevsky et al, 2012) proposed to apply DNN to do object recognition task (ImageNet dataset), which brought down the state-of-the-art error rate from 26.1% to 15.3%. (Seide et al, 2011) and (Dahl et al, 2012) apply Context-Dependent Deep Neural Network with HMM (CD-DNN-HMM) to speech recognition task, which significantly outperforms traditional models.
Most methods using DNN in NLP start with a word embedding phase, which maps words into a fixed length, real valued vectors. (Bengio et al, 2006) proposed to use multi-layer neural network for language modeling task. (Collobert et al, 2011) applied DNN on several NLP tasks, such as part-of-speech tagging, chunking, name entity recognition, semantic labeling and syntactic parsing, where they got similar or even better results than the state-of-the-art on these tasks. (Niehues and Waibel, 2012) shows that machine translation results can be improved by combining neural language model with n-gram traditional language. (Son et al, 2012) improves translation quality of n-gram translation model by using a bilingual neural language model. (Titov et al, 2012) learns a context-free cross-lingual word embeddings to facilitate cross-lingual information retrieval.
For the related works of word alignment, the most popular methods are based on generative models such as IBM Models (Brown et al, 1993) and HMM (Vogel et al, 1996). Discriminative approaches are also proposed to use hand crafted features to improve word alignment. Among them, (Liu et al, 2010) proposed to use phrase and rule pairs to model the context information in a loglinear framework. Unlike previous discriminative methods, in this work, we do not resort to any hand crafted features, but use DNN to induce “features” from raw words.
Study subjects and analysis
sentence pairs: 491
We conduct our experiment on Chinese-to-English word alignment task. We use the manually aligned Chinese-English alignment corpus (Haghighi et al, 2009) which contains 491 sentence pairs as test set. We adapt the segmentation on the Chinese side to fit our word segmentation standard
- Yoshua Bengio, Holger Schwenk, Jean-Sebastien Senecal, Frederic Morin, and Jean-Luc Gauvain. 2006. Neural probabilistic language models. Innovations in Machine Learning, pages 137–186.
- Hugo Larochelle. 2007. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153. Yoshua Bengio. 2009. Learning deep architectures for ai. Foundations and Trends R in Machine Learning, 2(1):1–127.
- Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, and Yann LeCun. 2010. Learning convolutional feature hierarchies for visual recognition. Advances in Neural Information Processing Systems, pages 1090–1098.
- Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language TechnologyVolume 1, pages 48–5Association for Computational Linguistics.
- Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114.
- Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Yann LeCun. 1985. A learning scheme for asymmetric threshold networks. Proceedings of Cognitiva, 85:599–604.
- Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. 2007. Efficient sparse coding algorithms. Advances in neural information processing systems, 19:801.
- Shujie Liu, Chi-Ho Li, and Ming Zhou. 2010. Discriminative pruning for discriminative itg alignment. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL, volume 10, pages 316–324.
- Y MarcAurelio Ranzato, Lan Boureau, and Yann LeCun. 2007. Sparse feature learning for deep belief networks. Advances in neural information processing systems, 20:1185–1192.
- Robert C Moore. 2005. A discriminative framework for bilingual word alignment. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 81–88. Association for Computational Linguistics.
- Jan Niehues and Alex Waibel. 20Continuous space language models using restricted boltzmann machines. In Proceedings of the nineth International Workshop on Spoken Language Translation (IWSLT).
- Franz Josef Och and Hermann Ney. 2000. Giza++: Training of statistical translation models.
- Frank Seide, Gang Li, and Dong Yu. 2011. Conversational speech transcription using context-dependent deep neural networks. In Proc. Interspeech, pages 437–440.
- Noah A Smith and Jason Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 354–362. Association for Computational Linguistics.
- Richard Socher, Cliff C Lin, Andrew Y Ng, and Christopher D Manning. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th International Conference on Machine Learning (ICML), volume 2, page 7.
- Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1201–1211. Association for Computational Linguistics.
- Le Hai Son, Alexandre Allauzen, and Francois Yvon. 2012. Continuous space translation models with neural networks. In Proceedings of the 2012 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 39–48. Association for Computational Linguistics.
- Ivan Titov, Alexandre Klementiev, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words.
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. Urbana, 51:61801.
- Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. Hmm-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics-Volume 2, pages 836– 841. Association for Computational Linguistics.
- Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational linguistics, 23(3):377–403.