Mining order-preserving submatrices from probabilistic matrices

ACM Trans. Database Syst.(2014)

引用 19|浏览37
暂无评分
摘要
Order-preserving submatrices (OPSMs) capture consensus trends over columns shared by rows in a data matrix. Mining OPSM patterns discovers important and interesting local correlations in many real applications, such as those involving biological data or sensor data. The prevalence of uncertain data in various applications, however, poses new challenges for OPSM mining, since data uncertainty must be incorporated into OPSM modeling and the algorithmic aspects. In this article, we define new probabilistic matrix representations to model uncertain data with continuous distributions. A novel probabilistic order-preserving submatrix (POPSM) model is formalized in order to capture similar local correlations in probabilistic matrices. The POPSM model adopts a new probabilistic support measure that evaluates the extent to which a row belongs to a POPSM pattern. Due to the intrinsic high computational complexity of the POPSM mining problem, we utilize the anti-monotonic property of the probabilistic support measure and propose an efficient Apriori-based mining framework called ProbApri to mine POPSM patterns. The framework consists of two mining methods, UniApri and NormApri, which are developed for mining POPSM patterns, respectively, from two representative types of probabilistic matrices, the UniDist matrix (assuming uniform data distributions) and the NormDist matrix (assuming normal data distributions). We show that the NormApri method is practical enough for mining POPSM patterns from probabilistic matrices that model more general data distributions. We demonstrate the superiority of our approach by two applications. First, we use two biological datasets to illustrate that the POPSM model better captures the characteristics of the expression levels of biologically correlated genes and greatly promotes the discovery of patterns with high biological significance. Our result is significantly better than the counterpart OPSMRM (OPSM with repeated measurement) model which adopts a set-valued matrix representation to capture data uncertainty. Second, we run the experiments on an RFID trace dataset and show that our POPSM model is effective and efficient in capturing the common visiting subroutes among users.
更多
查看译文
关键词
probabilistic matrix,popsm pattern,general data distribution,uncertain data,popsm model,biological data,sensor data,data uncertainty,data matrix,normal data distribution
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要