How to Select Samples for Active Learning? Document Clustering with Active Learning Methodology
IEEE International Conference on Engineering of Complex Computer Systems(2023)
摘要
In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach.
更多查看译文
关键词
active learning,document clustering,natural language processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要