A Primer on Theory-Driven Web Scraping: Automatic Extraction of Big Data From the Internet for Use in Psychological Research.

Richard N Landers,Robert C Brusso,Katelyn J Cavanaugh,Andrew B Collmus

PSYCHOLOGICAL METHODS（2016）

引用 125|浏览21

暂无评分

摘要

The term big data encompasses a wide range of approaches of collecting and analyzing data in ways that were not possible before the era of modern personal computing. One approach to big data of great potential to psychologists is web scraping, which involves the automated collection of information from webpages. Although web scraping can create massive big datasets with tens of thousands of variables, it can also be used to create modestly sized, more manageable datasets with tens of variables but hundreds of thousands of cases, well within the skillset of most psychologists to analyze, in a matter of hours. In this article, we demystify web scraping methods as currently used to examine research questions of interest to psychologists. First, we introduce an approach called theory-driven web scraping in which the choice to use web-based big data must follow substantive theory. Second, we introduce data source theories, a term used to describe the assumptions a researcher must make about a prospective big data source in order to meaningfully scrape data from it. Critically, researchers must derive specific hypotheses to be tested based upon their data source theory, and if these hypotheses are not empirically supported, plans to use that data source should be changed or eliminated. Third, we provide a case study and sample code in Python demonstrating how web scraping can be conducted to collect big data along with links to a web tutorial designed for psychologists. Fourth, we describe a 4-step process to be followed in web scraping projects. Fifth and finally, we discuss legal, practical and ethical concerns faced when conducting web scraping projects.

查看译文

关键词

big data,web scraping,data source theory,Python,tutorial

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要