Semantically-Guided Clustering Of Text Documents Via Frequent Subgraphs Discovery
ISMIS'11: Proceedings of the 19th international conference on Foundations of intelligent systems(2011)
摘要
In this paper we introduce and analyze two improvements to GDClust [1], a system for document clustering based on the co-occurrence of frequent subgraphs. GDClust (Graph-Based Document Clustering) works with frequent senses derived from the constraints provided by the natural language rather than working with the co-occurrences of frequent keywords commonly used in the vector space model (VSM) of document clustering. Text documents are transformed to hierarchical document-graphs, and an efficient graph-mining technique is used to find frequent subgraphs. Discovered frequent subgraphs are then utilized to generate accurate sense-based document clusters. In this paper, we introduce two novel mechanisms called the Subgraph-Extension Generator (SEG) and the Maximum Subgraph-Extension Generator (MaxSEG) which directly utilize constraints from the natural language to reduce the number of candidates and the overhead imposed by our first implementation of GDClust.
更多查看译文
关键词
graph-based data mining,text clustering,clustering with semantic constraints
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络