Unsupervised clustering of file dialects according to monotonic decompositions of mixtures

CoRR(2023)

引用 0|浏览10
暂无评分
摘要
This paper proposes an unsupervised classification method that partitions a set of files into non-overlapping dialects based upon their behaviors, determined by messages produced by a collection of programs that consume them. The pattern of messages can be used as the signature of a particular kind of behavior, with the understanding that some messages are likely to co-occur, while others are not. We propose a novel definition for a file format dialect, based upon these behavioral signatures. A dialect defines a subset of the possible messages, called the required messages. Once files are conditioned upon a dialect and its required messages, the remaining messages are statistically independent. With this definition in hand, we present a greedy algorithm that deduces candidate dialects from a dataset consisting of a matrix of file-message data, demonstrate its performance on several file formats, and prove conditions under which it is optimal. We show that an analyst needs to consider fewer dialects than distinct message patterns, which reduces their cognitive load when studying a complex format.
更多
查看译文
关键词
file-format-dialect,statistical-method,independent-mixture-model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要