Exploiting Collective Hidden Structures In Webpage Titles For Open Domain Entity Extraction

WWW '15: 24th International World Wide Web Conference Florence Italy May, 2015(2015)

引用 7|浏览61
暂无评分
摘要
We present a novel method for open domain named entity extraction by exploiting the collective hidden structures in webpage titles. Our method uncovers the hidden textual structures shared by sets of webpage titles based on generalized URL patterns and a multiple sequence alignment technique. The highlights of our method include: 1) The boundaries of entities can be identified automatically in a collective way without any manually designed pattern, seed or class name. 2) The connections between entities are also discovered naturally based on the hidden structures, which makes it easy to incorporate distant or weak supervision. The experiments show that our method can harvest large scale of open domain entities with high precision. A large ratio of the extracted entities are long-tailed and complex and cover diverse topics. Given the extracted entities and their connections, we further show the effectiveness of our method in a weakly supervised setting. Our method can produce better domain specific entities in both precision and recall compared with the state-of-the-art approaches.
更多
查看译文
关键词
Open domain entity extraction,Multiple sequence alignment,URL pattern,Template generation,Weak supervision
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要