Sifaka: Text Mining Above a Search API

arXiv (Cornell University)（2018）

引用 0|浏览24

暂无评分

摘要

Text mining and analytics software has become popular, but little attention has been paid to the software architectures of such systems. Often they are built from scratch using special-purpose software and data structures, which increases their cost and complexity. This demo paper describes Sifaka, a new open-source text mining application constructed above a standard search engine index using existing application programmer interface (API) calls. Indexing integrates popular annotation software libraries to augment the full-text index with noun phrase and named-entities; n-grams are also provided. Sifaka enables a person to quickly explore and analyze large text collections using search, frequency analysis, and co-occurrence analysis; and import existing document labels or interactively construct document sets that are positive or negative examples of new concepts, perform feature selection, and export feature vectors compatible with popular machine learning software. Sifaka demonstrates that search engines are good platforms for text mining applications while also making common IR text mining capabilities accessible to researchers in disciplines where programming skills are less common.

查看译文

关键词

text mining,search api

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要