Web Page Segmentation Revisited: Evaluation Framework and Dataset

CIKM '20: The 29th ACM International Conference on Information and Knowledge Management Virtual Event Ireland October, 2020(2020)

引用 9|浏览23
暂无评分
摘要
Each web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations are still lacking. Specifically, the developed evaluation methods and datasets presume a certain downstream task, which led to a variety of incompatible datasets and evaluation methods. To address this shortcoming, we contribute two resources: (1) An evaluation framework which can be adjusted to downstream tasks by measuring the segmentation similarity regarding visual, structural, and textual elements, and which includes measures for annotator agreement, segmentation quality, and an algorithm for segmentation fusion. (2) The Webis-WebSeg-20 dataset, comprising 42,450~crowdsourced segmentations for 8,490~web pages, outranging existing sources by an order of magnitude. Our results help to better understand the "mental segmentation model'' of human annotators: Among other things we find that annotators mostly agree on segmentations for all kinds of web page elements (visual, structural, and textual). Disagreement exists mostly regarding the right level of granularity, indicating a general agreement on the visual structure of web pages.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要