AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
For a standard trigram hidden Markov model, taking a Bayesian approach to POS tagging dramatically improves performance over maximum-likelihood estimation

A fully Bayesian approach to unsupervised part-of-speech tagging

ACL, (2007)

被引用339|浏览14
EI
下载 PDF 全文
引用
微博一下

摘要

Unsupervised learning of linguistic structure is a difficult problem. A common approach is to define a generative model and max- imize the probability of the hidden struc- ture given the observed data. Typically, this is done using maximum-likelihood es- timation (MLE) of the model parameters. We show using part-of-speech tagging that a f...更多

代码

数据

简介
  • Using part-of-speech (POS) tagging as an example application, the authors show that the Bayesian approach provides large performance improvements over maximum-likelihood estimation (MLE) for the same model structure.
  • Standard approaches do so by selecting values for the model parameters, and choosing the most probable variable assignment based on those parameters.
  • A non-uniform prior distribution over θ is introduced, in which case θis the maximum a posteriori (MAP) solution for θ: In this paper, the authors hope to unify the problems of POS disambiguation and syntactic clustering by presenting results for conditions ranging from a full tag dictionary to no dictionary at all.
重点内容
  • For comparison with our Bayesian hidden Markov model (BHMM) in this and following sections, we present results from the Viterbi decoding of an hidden Markov model trained using maximum-likelihood estimation by running EM to convergence (MLHMM)
  • A final point worth noting is that even when α = β = 1 the Bayesian HMM still performs much better than the MLHMM. This result underscores the importance of integrating over model parameters: the Bayesian HMM identifies a sequence of tags that have high probability over a range of parameter values, rather than choosing tags based on the single best set of parameters
  • The confusion matrices in Figure 3 provide a more intuitive picture of the very different sorts of clusterings produced by MLHMM and BHMM2 when no tag dictionary is available
  • We have demonstrated that, for a standard trigram hidden Markov model, taking a Bayesian approach to POS tagging dramatically improves performance over maximum-likelihood estimation
  • Integrating over possible parameter values leads to more robust solutions and allows the use of priors favoring sparse distributions
结果
  • The Bayesian approach the authors advocate in this paper seeks to identify a distribution over latent variables directly, without ever fixing particular values for the model parameters.
  • The authors' results show that the Bayesian approach is use- This distribution can be used in various ways, inful when learning is less constrained, either because cluding choosing the MAP assignment to the latent less evidence is available or variables, or estimating expected values for them.
  • The authors' model has the structure of a standard trigram HMM, with the addition of symmetric Dirichlet priors over the transition and output distributions: ti|ti−1 = t, ti−2 = t′, τ (t,t′) ∼ Mult(τ (t,t′))
  • Using a single sample makes standard evaluation methods possible, but yields suboptimal results because the value for each tag is sampled from a distribution, and some tags will be assigned low-probability values.
  • This result underscores the importance of integrating over model parameters: the BHMM identifies a sequence of tags that have high probability over a range of parameter values, rather than choosing tags based on the single best set of parameters.
  • The improved results of the BHMM demonstrate that selecting a sequence that is robust to variations in the parameters leads to better performance.
  • In this set of experiments, the authors used the full tag dictionary, but performed inference on the hyperparameters.
  • The confusion matrices in Figure 3 provide a more intuitive picture of the very different sorts of clusterings produced by MLHMM and BHMM2 when no tag dictionary is available.
结论
  • The authors have demonstrated that, for a standard trigram HMM, taking a Bayesian approach to POS tagging dramatically improves performance over maximum-likelihood estimation.
  • Integrating over possible parameter values leads to more robust solutions and allows the use of priors favoring sparse distributions.
  • The authors hope that the success with POS tagging will inspire further research into Bayesian methods for other natural language learning tasks
表格
  • Table1: Percentage of words tagged correctly by BHMM as a function of the hyperparameters α and β. Results are averaged over 5 runs on the 24k corpus with full tag dictionary. Standard deviations in most cases are less than .5
  • Table2: Percentage of words tagged correctly by the various models on different sized corpora. BHMM1 and BHMM2 use hyperparameter inference; CRF/CE uses parameter selection based on an unlabeled development set. Standard deviations (σ) for the BHMM results fell below those shown for each corpus size
  • Table3: Percentage of words tagged correctly and variation of information between clusterings induced by the assigned and gold standard tags as the amount of information in the dictionary is varied. Standard deviations (σ) for the BHMM results fell below those shown in each column. The percentage of ambiguous tokens and average number of tags per token for each value of d is also shown
Download tables as Excel
基金
  • ∗This work was supported by grants NSF 0631518 and ONR MURI N000140510388
引用论文
  • M. Banko and R. Moore. 2004. A study of unsupervised partof-speech tagging. In Proceedings of COLING ’04.
    Google ScholarLocate open access versionFindings
  • E. Brill. 1995. Unsupervised learning of disambiguation rules for part of speech tagging. In Proceedings of the 3rd Workshop on Very Large Corpora, pages 1–13.
    Google ScholarLocate open access versionFindings
  • P. Brown, V. Della Pietra, V. de Souza, J. Lai, and R. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18:467–479.
    Google ScholarLocate open access versionFindings
  • A. Clark. 2000. Inducing syntactic categories by context distribution clustering. In Proceedings of the Conference on Natural Language Learning (CONLL).
    Google ScholarLocate open access versionFindings
  • S. Finch, N. Chater, and M. Redington. 199Acquiring syntactic information from distributional statistics. In J. In Levy, D. Bairaktaris, J. Bullinaria, and P. Cairns, editors, Connectionist Models of Memory and Language. UCL Press, London.
    Google ScholarLocate open access versionFindings
  • S. Geman and D. Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741.
    Google ScholarLocate open access versionFindings
  • W.R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. 1996. Markov Chain Monte Carlo in Practice. Chapman and Hall, Suffolk.
    Google ScholarFindings
  • A. Haghighi and D. Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of HLT-NAACL.
    Google ScholarLocate open access versionFindings
  • M. Johnson, T. Griffiths, and S. Goldwater. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo.
    Google ScholarFindings
  • D. Klein and C. Manning. 2002. A generative constituentcontext model for improved grammar induction. In Proceedings of the ACL.
    Google ScholarLocate open access versionFindings
  • D. MacKay and L. Bauman Peto. 1995. A hierarchical Dirichlet language model. Natural Language Engineering, 1:289– 307.
    Google ScholarLocate open access versionFindings
  • M. Meila. 2002. Comparing clusterings. Technical Report 418, University of Washington Statistics Department.
    Google ScholarFindings
  • B. Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155–172.
    Google ScholarLocate open access versionFindings
  • L. Saul and F. Pereira. 1997. Aggregate and mixed-order markov models for statistical language processing. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • H. Schutze. 1995. Distributional part-of-speech tagging. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL).
    Google ScholarLocate open access versionFindings
  • N. Smith and J. Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of ACL.
    Google ScholarLocate open access versionFindings
  • I. Wang and D. Schuurmans. 2005. Improved estimation for unsupervised part-of-speech tagging. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE).
    Google ScholarLocate open access versionFindings
作者
Tom Griffiths
Tom Griffiths
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科