Smart Algorithmic Based Web Crawling And Scraping With Template Autoupdate Capabilities

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE(2021)

引用 2|浏览13
暂无评分
摘要
Web scraping is the process of extracting data from web pages and it is an essential part for the generation of datasets. Currently the field is dominated by capable commercial applications, however, there is always a need for web crawling and web scraping applications for custom projects. Developing fit for purpose tools for retrieving and structuring data from web services, cloud systems, and big data is a challenging task. Based on empirical studies, some of the challenges include structural issues, formatting/ presentation, availability, denial of service, size, and information fetching problems with browsers. Additionally, the data become inaccessible after the structure/template of the website changes for example, after the website update. Thus the dataset cannot be updated in the future without manually modifying the parameters of the Web Scraper. In this paper we propose an algorithm capable of autocorrecting the template (web scraping parameters) used for locating the target data and dealing with some common empirical problems. This is very useful in case there is a need for updating the dataset later, as usually, websites tend to change their pages. Moreover, we introduce an implementation of the algorithm via a tool developed for extracting data from the unity asset store. The tool can capture and store data in XML format. The tool extracted a total of 46 785 (40 611 3D and 6174 2D) items, with 35 successful first retries, 11 second retries and 5 fails.
更多
查看译文
关键词
unity asset store dataset, web crawling, web data extraction, web harvesting, web scraping
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要