AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have described a novel formulation for a neural network-based machine translation joint model, along with several simple variations of this model

Fast and Robust Neural Network Joint Models for Statistical Machine Translation.

ACL, pp.1370-1380, (2014)

Cited by: 437|Views483
EI
Full Text
Bibtex
Weibo

Abstract

1 Introduction Recent work has shown success in using neural network language models (NNLMs) as features in MT systems. Here, we present a novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window. Our model is purely lexicalized and can be integrated into any MT decoder. We also prese...More

Code:

Data:

0
Introduction
  • Recent work has shown success in using neural network language models (NNLMs) as features in MT systems.
  • The authors present a novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window.
  • On the NIST OpenMT12 Arabic-English condition, the NNJM features produce a gain of +3.0 BLEU on top of a powerful, featurerich baseline which already includes a target-only NNLM.
  • The NNJM features produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation
Highlights
  • Recent work has shown success in using neural network language models (NNLMs) as features in MT systems
  • We demonstrate in Section 6.6 that using the selfnormalized/pre-computed neural network joint model (NNJM) results in only a very small BLEU degradation compared to the standard NNJM
  • We present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions
  • Our NIST system is fully compatible with the OpenMT12 constrained track, which consists of 10M words of high-quality parallel training for Arabic, and 25M words for Chinese
  • We use the “Arabic-To-English Original Progress Test” (1378 segments) and “Chinese-to-English Original Progress Test + OpenMT12 Current Test” (2190 segments), which consists of a mix of newswire and web data
  • We have described a novel formulation for a neural network-based machine translation joint model, along with several simple variations of this model
Results
  • The authors present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions.
  • The authors' NIST system is fully compatible with the OpenMT12 constrained track, which consists of 10M words of high-quality parallel training for Arabic, and 25M words for Chinese.10.
  • The authors use the “Arabic-To-English Original Progress Test” (1378 segments) and “Chinese-to-English Original Progress Test + OpenMT12 Current Test” (2190 segments), which consists of a mix of newswire and web data.11.
  • Table 5 shows performance when our S2T/L2R NNJM is used only in 1000-best rescoring, compared to decoding.
Conclusion
  • The authors have described a novel formulation for a neural network-based machine translation joint model, along with several simple variations of this model.
  • The authors' model is remarkably simple – it requires no linguistic resources, no feature engineering, and only a handful of hyper-parameters
  • It has no reliance on potentially fragile outside algorithms, such as unsupervised word clustering.
  • The authors consider the simplicity to be a major advantage.
  • Does this suggest that it will generalize well to new language pairs and domains, but it suggests that it will be straightforward for others to replicate these results
Summary
  • Introduction:

    Recent work has shown success in using neural network language models (NNLMs) as features in MT systems.
  • The authors present a novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window.
  • On the NIST OpenMT12 Arabic-English condition, the NNJM features produce a gain of +3.0 BLEU on top of a powerful, featurerich baseline which already includes a target-only NNLM.
  • The NNJM features produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation
  • Objectives:

    The authors' goal is to be able to use a fairly large vocabulary without word classes, and to avoid computing the entire output layer at decode time.4.
  • One of the biggest goals of this work is to quell any remaining doubts about the utility of neural networks in machine translation
  • Results:

    The authors present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions.
  • The authors' NIST system is fully compatible with the OpenMT12 constrained track, which consists of 10M words of high-quality parallel training for Arabic, and 25M words for Chinese.10.
  • The authors use the “Arabic-To-English Original Progress Test” (1378 segments) and “Chinese-to-English Original Progress Test + OpenMT12 Current Test” (2190 segments), which consists of a mix of newswire and web data.11.
  • Table 5 shows performance when our S2T/L2R NNJM is used only in 1000-best rescoring, compared to decoding.
  • Conclusion:

    The authors have described a novel formulation for a neural network-based machine translation joint model, along with several simple variations of this model.
  • The authors' model is remarkably simple – it requires no linguistic resources, no feature engineering, and only a handful of hyper-parameters
  • It has no reliance on potentially fragile outside algorithms, such as unsupervised word clustering.
  • The authors consider the simplicity to be a major advantage.
  • Does this suggest that it will generalize well to new language pairs and domains, but it suggests that it will be straightforward for others to replicate these results
Tables
  • Table1: Comparison of neural network likelihood for various α values. log(P (x)) is the average log-likelihood on a held-out set. | log(Z(x))| is the mean error in log-likelihood when using Ur(x) directly instead of the true softmax probability log(P (x)). Note that α = 0 is equivalent to the standard neural network objective function
  • Table2: Speed of the neural network computation on a single CPU thread. “lookups/sec” is the number of unique n-gram probabilities that can be computed per second. “sec/word” is the amortized cost of unique NNJM lookups in decoding, per source word
  • Table3: Primary results on Arabic-English and Chinese-English NIST MT12 Test Set. The first section corresponds to the top and bottom ranked systems from the evaluation, and are taken from the NIST website. The second section corresponds to results on top of our strongest baseline. The third section corresponds to results on top of a simpler baseline. Within each section, each row includes all of the features from previous rows. BLEU scores are mixed-case
  • Table4: Primary results on Arabic-English and Chinese-English BOLT Web Forum. Each row includes the aggregate features from all previous rows
  • Table5: Comparison of our primary NNJM in decoding vs. 1000-best rescoring
  • Table6: Results with different neural network architectures. The “default” NNJM in the second row uses these parameters: SW=11, L=192x512x512, V=32,000, A=tanh. All models use a 3-word target history (i.e., 4-gram LM). “Layers” refers to the size of the word embedding followed by the hidden layers. “Vocab” refers to the size of the input and output vocabularies. “% Gain” is the BLEU gain over the baseline relative to the default NNJM
  • Table7: Results for the standard NNs vs. selfnormalized NNs vs. pre-computed NNs
Download tables as Excel
Related work
  • Although there has been a substantial amount of past work in lexicalized joint models (Marino et al, 2006; Crego and Yvon, 2010), nearly all of these papers have used older statistical techniques such as Kneser-Ney or Maximum Entropy. However, not only are these techniques intractable to train with high-order context vectors, they also lack the neural network’s ability to semantically generalize (Mikolov et al, 2013) and learn nonlinear relationships.

    A number of recent papers have proposed methods for creating neural network translation/joint models, but nearly all of these works have obtained much smaller BLEU improvements than ours. For each related paper, we will briefly con-

    14In our decoder, roughly 95% of NNJM n-gram lookups within the same sentence are duplicates.

    trast their methodology with our own and summarize their BLEU improvements using scores taken directly from the cited paper.

    Auli et al (2013) use a fixed continuous-space source representation, obtained from LDA (Blei et al, 2003) or a source-only NNLM. Also, their model is recurrent, so it cannot be used in decoding. They obtain +0.2 BLEU improvement on top of a target-only NNLM (25.6 vs. 25.8).
Funding
  • This work was supported by DARPA/I2O Contract No HR0011-12-C-0014 under the BOLT program (Approved for Public Release, Distribution Unlimited)
Reference
  • Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint language and translation modeling with recurrent neural networks. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1044– 1054, Seattle, Washington, USA, October. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155.
    Google ScholarLocate open access versionFindings
  • David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 200Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March.
    Google ScholarLocate open access versionFindings
  • David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine translation. In HLT-NAACL, pages 218–226.
    Google ScholarLocate open access versionFindings
  • David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201–228.
    Google ScholarLocate open access versionFindings
  • Josep Maria Crego and Francois Yvon. 2010. Factored bilingual n-gram language models for statistical machine translation. Machine Translation, 24(2):159– 175.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin and Spyros Matsoukas. 2012. Traitbased hypothesis selection for machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 528–532, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin. 2009. Lexical features for statistical machine translation. Master’s thesis, University of Maryland.
    Google ScholarFindings
  • Nizar Habash, Ryan Roth, Owen Rambow, Ramy Eskander, and Nadi Tomeh. 2013. Morphological analysis and disambiguation for dialectal arabic. In HLT-NAACL, pages 426–432.
    Google ScholarLocate open access versionFindings
  • Zhongqiang Huang, Jacob Devlin, and Rabih Zbib. 2013. Factored soft source syntactic constraints for hierarchical machine translation. In EMNLP, pages 556–566.
    Google ScholarLocate open access versionFindings
  • Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models.
    Google ScholarFindings
  • Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, volume 1, pages 181–184. IEEE.
    Google ScholarLocate open access versionFindings
  • Hai-Son Le, Alexandre Allauzen, and Francois Yvon. 2012. Continuous space translation models with neural networks. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 39– 48, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Leon Bottou, Genevieve B Orr, and Klaus-Robert Muller. 1998. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer.
    Google ScholarLocate open access versionFindings
  • Jose B Marino, Rafael E Banchs, Josep M Crego, Adria De Gispert, Patrik Lambert, Jose AR Fonollosa, and Marta R Costa-Jussa. 2006. N-gram-based machine translation. Computational Linguistics, 32(4):527– 549.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH, pages 1045–1048.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746– 751.
    Google ScholarLocate open access versionFindings
  • Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
    Google ScholarLocate open access versionFindings
  • Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063.
    Findings
  • Jason Riesa, Ann Irvine, and Daniel Marcu. 2011. Feature-rich language-independent syntax-based alignment for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 497–507, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ronald Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. Computer, Speech and Language, 10:187–228.
    Google ScholarLocate open access versionFindings
  • Antti Rosti, Bing Zhang, Spyros Matsoukas, and Rich Schwartz. 2010. BBN system description for WMT10 system combination task. In WMT/MetricsMATR, pages 321–326.
    Google ScholarLocate open access versionFindings
  • Holger Schwenk. 2010. Continuous-space language models for statistical machine translation. Prague Bull. Math. Linguistics, 93:137–146.
    Google ScholarLocate open access versionFindings
  • Holger Schwenk. 2012. Continuous space translation models for phrase-based statistical machine translation. In COLING (Posters), pages 1071–1080.
    Google ScholarLocate open access versionFindings
  • Libin Shen, Jinxi Xu, and Ralph Weischedel. 2010. String-to-dependency statistical machine translation. Computational Linguistics, 36(4):649–671, December.
    Google ScholarLocate open access versionFindings
  • Matthew Snover, Bonnie Dorr, and Richard Schwartz. 2008. Language and translation model adaptation using comparable corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 857–866, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Makoto Tanaka, Yasuhara Toru, Jun-ya Yamamoto, and Mikio Norimatsu. 2013. An efficient language model using double-array structures.
    Google ScholarFindings
  • Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with largescale neural language models improves translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1387–1392, Seattle, Washington, USA, October. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1393–1398.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Best Paper
Best Paper of ACL, 2014
Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科