Content extraction using diverse feature sets

WWW (Companion Volume)（2013）

引用 58|浏览25

暂无评分

摘要

The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, copyright notices and the like in web pages. In this paper we explore a machine learning approach to content extraction that combines diverse feature sets and methods. Our main contributions are: a) preliminary results that show combining feature sets generally improves performance; and b) a method for including semantic information via id and class attributes applicable to HTML5. We also show that performance decreases on a new benchmark data set that better represents modern chrome.

查看译文

关键词

main contribution,navigation chrome,diverse feature set,performance decrease,copyright notice,modern chrome,main content,boilerplate detection,content extraction,advertising block

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要