Hash-Grams: Faster N-Gram Features for Classification and Malware Detection

PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG 2018)(2018)

引用 5|浏览10
暂无评分
摘要
N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要