Got 404s? Crawling and Analyzing an Institution's Web Domain

LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES (TPDL 2022)(2022)

引用 0|浏览17
暂无评分
摘要
Link rot - disappearance of web resources - is detrimental to an institution's web presence, which is commonly used to communicate, for example, research highlights and organizational news. Organizations, especially taxpayer-funded ones such as the Los Alamos National Laboratory (LANL), therefore put emphasis on the availability and authenticity of their institutional record on the web. We conducted a web crawl of the lanl.gov domain and investigated the scale of missing resources and the ratio of resources recovered from public web archives. We found a noticeable number of special cases of link rot (soft404s) and transient errors, and had little success in recovering resources from web archives. We argue that, as an institution, we could become a better steward of our web content by establishing an institutional web archive to improve the availability and authenticity of web resources.
更多
查看译文
关键词
Institutional web archiving,Link rot,Domain crawl
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要