Can Humans Identify Domains?
arxiv(2024)
摘要
Textual domain is a crucial property within the Natural Language Processing
(NLP) community due to its effects on downstream model performance. The concept
itself is, however, loosely defined and, in practice, refers to any
non-typological property, such as genre, topic, medium or style of a document.
We investigate the core notion of domains via human proficiency in identifying
related intrinsic textual properties, specifically the concepts of genre
(communicative purpose) and topic (subject matter). We publish our annotations
in *TGeGUM*: A collection of 9.1k sentences from the GUM dataset (Zeldes, 2017)
with single sentence and larger context (i.e., prose) annotations for one of 11
genres (source type), and its topic/subtopic as per the Dewey Decimal library
classification system (Dewey, 1979), consisting of 10/100 hierarchical topics
of increased granularity. Each instance is annotated by three annotators, for a
total of 32.7k annotations, allowing us to examine the level of human
disagreement and the relative difficulty of each annotation task. With a
Fleiss' kappa of at most 0.53 on the sentence level and 0.66 at the prose
level, it is evident that despite the ubiquity of domains in NLP, there is
little human consensus on how to define them. By training classifiers to
perform the same task, we find that this uncertainty also extends to NLP
models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要