Remote Elicitation of Inflectional Paradigms to Seed Morphological Analysis in Low-Resource Languages.
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION(2016)
摘要
Structured, complete inflectional paradigm data exists for very few of the world's languages, but is crucial to training morphological analysis tools. We present methods inspired by linguistic fieldwork for gathering inflectional paradigm data in a machine-readable, interoperable format from remotely-located speakers of any language. Informants are tasked with completing language-specific paradigm elicitation templates. Templates are constructed by linguists using grammatical reference materials to ensure completeness. Each cell in a template is associated with contextual prompts designed to help informants with varying levels of linguistic expertise (from professional translators to untrained native speakers) provide the desired inflected form. To facilitate downstream use in interoperable NLP/HLT applications, each cell is also associated with a language-independent machine-readable set of morphological tags from the UniMorph Schema. This data is useful for seeding morphological analysis and generation software, particularly when the data is representative of the range of surface morphological variation in the language. At present, we have obtained 792 lemmas and 25,056 inflected forms from 15 languages.
更多查看译文
关键词
Low-Resource Languages,Morphology,UniMorph Schema,Seed Corpus,Crowdsourcing,Linguistic Fieldwork
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络