A focused linked data crawler based on HTML link analysis

Computer and Knowledge Engineering（2014）

引用 7|浏览9

暂无评分

摘要

Linked Data can be published as RDF documents or embedded in HTML documents. A linked data crawler is a program that discovers the published linked data from the web by following RDF links. Note that there are RDF documents that are surrounded by HTML documents. Therefore, linked data crawlers require to follow HTML links in addition to RDF links to be able to discover such RDF documents as well as harvest the embedded linked data in HTML documents. However, many HTML documents have not embedded any linked data and not pointed to any RDF documents. So, crawling such HTML documents decreases discovery rate of RDF documents per unit of network bandwidth and wastes computation resources on non-RDF documents. In this paper, a focused linked data crawler is proposed to address this problem. The proposed crawler analyzes and prioritizes HTML links by calculating the possibility that a link will lead to an RDF document. The experimental evaluation shows that the proposed approach is effective in terms of increasing discovery rate of RDF document in comparison with a non-focused linked data crawler.

查看译文

关键词

hypermedia markup languages,search engines,html document crawling,html link analysis,html link prioritization,rdf document discovery rate,rdf links,computation resource wastage,embedded linked data,focused linked data crawler,network bandwidth,nonrdf documents,published linked data discovery,html link,rdf link,discovery rate,focused crawler,linked data crawler

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要