EMNLP@CPH: Is frequency all there is to simplicity?
Joint Conference on Lexical and Computational Semantics(2012)
摘要
Our system breaks down the problem of ranking a list of lexical substitutions according to how simple they are in a given context into a series of pairwise comparisons between candidates. For this we learn a binary classifier. As only very little training data is provided, we describe a procedure for generating artificial unlabeled data from Wordnet and a corpus and approach the classification task as a semi-supervised machine learning problem. We use a co-training procedure that lets each classifier increase the other classifier's training set with selected instances from an unlabeled data set. Our features include n-gram probabilities of candidate and context in a web corpus, distributional differences of candidate in a corpus of easy sentences and a corpus of normal sentences, syntactic complexity of documents that are similar to the given context, candidate length, and letter-wise recognizability of candidate as measured by a trigram character language model.
更多查看译文
关键词
candidate length,web corpus,artificial unlabeled data,binary classifier,classifier increase,training data,unlabeled data,co-training procedure,classification task,distributional difference
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络