Topic Modeling for the Social Sciences


引用 244|浏览163
As textual datasets grow in size and scope, social scientists need better tools to help make sense of that data. Despite the natural applicability of topic modeling to many such problems, word counts and tag clouds are often used as the primary means of gleaning information from textual data. We characterize two barriers to adoption encountered during a collaboration between the Stanford NLP group and social scientists in the school of education: accessibility and trust. Accessibility refers to the technical barriers that make text processing and topic modeling diffi- cult. Trust comes when practitioners can explore and validate a model being used to discover or support a hypothesis. We introduce recent work aimed at solving these challenges including the Stanford Topic Modeling Toolbox software. Topic models hold great promise as a means of gleaning actionable insight from the text datasets now available to social scientists, business analysts, and others. The underlying goal of such investigators is a better understanding of some phenomena in the world through the text people have written. In the Mimir project at Stanford, computer scientists in the natural language processing group have worked closely with social scientists in the school of education. During this interaction, we discovered two main barriers to adoption of topic models in the social sciences. The first is accessibility of the models—text processing is messy, with most existing tools assuming a reasonable familiarity with scripting, command line software invocation, and data pre-processing. While many social scientists are technically capable, fewer are proficient at all these prerequisites. In Section 2, we describe this issue in more detail, and introduce the Stanford Topic Modeling Toolbox as a step toward more accessible topic modeling for the social sciences. The more central issue, perhaps, is trust. Ultimately, the intended usage of topic models is to tell a compelling story about textual data in order to support or inspire hypotheses. For example, a social scientist might wish to understand relationships between teens and teachers in online social networks. Armed with a corpus of text from a social networking site, these investigators may seek to uncover distinctions in teens' posts when they are or are not viewable by teachers. Topics can act as natural means to characterize these differences. But how can an investigator trust a system describing text that—by nature of the problem size—he or she has never read? This is a fundamental concern in topic modeling for text, which we consider in Section 3, arguing both for improved models to overcome existing shortcomings and better support for interactive exploration.
AI 理解论文
Chat Paper