DOM2R-Graph: A Web Attribute Extraction Architecture with Relation-Aware Heterogeneous Graph Transformer.

ICONIP (1)(2022)

引用 0|浏览25
Web attribute extraction refers to extracting structured entities with specific attributes (e.g. title, director, genre and mpaa rating for a movie) from HTML documents. Since each part of a web page corresponds to an unique node in the DOM tree, most of existing methods formulate web attribute extraction as a multi-class classification task of DOM tree nodes. However, they rarely focus on the multiple structural relations between DOM tree nodes, which will influence node semantic interactions. In this paper, we propose a novel web attribute extraction architecture called DOM2R-Graph, which integrates both node semantic information and heterogeneous structure information of DOM tree. Specifically, we first construct a heterogeneous graph by connecting DOM tree nodes and their contexts with edges indicating structural relations. Then, we propose a Relation-aware Heterogeneous Graph Transformer (RHGT), to effectively capture the heterogeneous features of structural relations and learn representations of nodes on the graph at a fine-grained level. Extensive experimental results on the public SWDE dataset show that DOM2R-Graph outperforms the state-of-the-art methods.
Web information extraction, Structured data extraction, Heterogeneous graph transformer
AI 理解论文
Chat Paper