For a standard trigram hidden Markov model, taking a Bayesian approach to POS tagging dramatically improves performance over maximum-likelihood estimation
A fully Bayesian approach to unsupervised part-of-speech tagging
Unsupervised learning of linguistic structure is a difficult problem. A common approach is to define a generative model and max- imize the probability of the hidden struc- ture given the observed data. Typically, this is done using maximum-likelihood es- timation (MLE) of the model parameters. We show using part-of-speech tagging that a f...更多
下载 PDF 全文
- Using part-of-speech (POS) tagging as an example application, the authors show that the Bayesian approach provides large performance improvements over maximum-likelihood estimation (MLE) for the same model structure.
- Standard approaches do so by selecting values for the model parameters, and choosing the most probable variable assignment based on those parameters.
- A non-uniform prior distribution over θ is introduced, in which case θis the maximum a posteriori (MAP) solution for θ: In this paper, the authors hope to unify the problems of POS disambiguation and syntactic clustering by presenting results for conditions ranging from a full tag dictionary to no dictionary at all.
- For comparison with our Bayesian hidden Markov model (BHMM) in this and following sections, we present results from the Viterbi decoding of an hidden Markov model trained using maximum-likelihood estimation by running EM to convergence (MLHMM)
- A final point worth noting is that even when α = β = 1 the Bayesian HMM still performs much better than the MLHMM. This result underscores the importance of integrating over model parameters: the Bayesian HMM identifies a sequence of tags that have high probability over a range of parameter values, rather than choosing tags based on the single best set of parameters
- The confusion matrices in Figure 3 provide a more intuitive picture of the very different sorts of clusterings produced by MLHMM and BHMM2 when no tag dictionary is available
- We have demonstrated that, for a standard trigram hidden Markov model, taking a Bayesian approach to POS tagging dramatically improves performance over maximum-likelihood estimation
- Integrating over possible parameter values leads to more robust solutions and allows the use of priors favoring sparse distributions
- The Bayesian approach the authors advocate in this paper seeks to identify a distribution over latent variables directly, without ever fixing particular values for the model parameters.
- The authors' results show that the Bayesian approach is use- This distribution can be used in various ways, inful when learning is less constrained, either because cluding choosing the MAP assignment to the latent less evidence is available or variables, or estimating expected values for them.
- The authors' model has the structure of a standard trigram HMM, with the addition of symmetric Dirichlet priors over the transition and output distributions: ti|ti−1 = t, ti−2 = t′, τ (t,t′) ∼ Mult(τ (t,t′))
- Using a single sample makes standard evaluation methods possible, but yields suboptimal results because the value for each tag is sampled from a distribution, and some tags will be assigned low-probability values.
- This result underscores the importance of integrating over model parameters: the BHMM identifies a sequence of tags that have high probability over a range of parameter values, rather than choosing tags based on the single best set of parameters.
- The improved results of the BHMM demonstrate that selecting a sequence that is robust to variations in the parameters leads to better performance.
- In this set of experiments, the authors used the full tag dictionary, but performed inference on the hyperparameters.
- The confusion matrices in Figure 3 provide a more intuitive picture of the very different sorts of clusterings produced by MLHMM and BHMM2 when no tag dictionary is available.
- The authors have demonstrated that, for a standard trigram HMM, taking a Bayesian approach to POS tagging dramatically improves performance over maximum-likelihood estimation.
- Integrating over possible parameter values leads to more robust solutions and allows the use of priors favoring sparse distributions.
- The authors hope that the success with POS tagging will inspire further research into Bayesian methods for other natural language learning tasks
- Table1: Percentage of words tagged correctly by BHMM as a function of the hyperparameters α and β. Results are averaged over 5 runs on the 24k corpus with full tag dictionary. Standard deviations in most cases are less than .5
- Table2: Percentage of words tagged correctly by the various models on different sized corpora. BHMM1 and BHMM2 use hyperparameter inference; CRF/CE uses parameter selection based on an unlabeled development set. Standard deviations (σ) for the BHMM results fell below those shown for each corpus size
- Table3: Percentage of words tagged correctly and variation of information between clusterings induced by the assigned and gold standard tags as the amount of information in the dictionary is varied. Standard deviations (σ) for the BHMM results fell below those shown in each column. The percentage of ambiguous tokens and average number of tags per token for each value of d is also shown
- ∗This work was supported by grants NSF 0631518 and ONR MURI N000140510388
- M. Banko and R. Moore. 2004. A study of unsupervised partof-speech tagging. In Proceedings of COLING ’04.
- E. Brill. 1995. Unsupervised learning of disambiguation rules for part of speech tagging. In Proceedings of the 3rd Workshop on Very Large Corpora, pages 1–13.
- P. Brown, V. Della Pietra, V. de Souza, J. Lai, and R. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18:467–479.
- A. Clark. 2000. Inducing syntactic categories by context distribution clustering. In Proceedings of the Conference on Natural Language Learning (CONLL).
- S. Finch, N. Chater, and M. Redington. 199Acquiring syntactic information from distributional statistics. In J. In Levy, D. Bairaktaris, J. Bullinaria, and P. Cairns, editors, Connectionist Models of Memory and Language. UCL Press, London.
- S. Geman and D. Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741.
- W.R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. 1996. Markov Chain Monte Carlo in Practice. Chapman and Hall, Suffolk.
- A. Haghighi and D. Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of HLT-NAACL.
- M. Johnson, T. Griffiths, and S. Goldwater. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo.
- D. Klein and C. Manning. 2002. A generative constituentcontext model for improved grammar induction. In Proceedings of the ACL.
- D. MacKay and L. Bauman Peto. 1995. A hierarchical Dirichlet language model. Natural Language Engineering, 1:289– 307.
- M. Meila. 2002. Comparing clusterings. Technical Report 418, University of Washington Statistics Department.
- B. Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155–172.
- L. Saul and F. Pereira. 1997. Aggregate and mixed-order markov models for statistical language processing. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP).
- H. Schutze. 1995. Distributional part-of-speech tagging. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).
- N. Smith and J. Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of ACL.
- I. Wang and D. Schuurmans. 2005. Improved estimation for unsupervised part-of-speech tagging. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE).