AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
We studied a new problem, called hot bursty events detection in a sequence of chronologically ordered documents, where a bursty event is a set of bursty features appearing in certain time windows

Parameter free bursty events detection in text streams

VLDB, pp.181-192, (2005)

引用499|浏览88
EI
下载 PDF 全文
引用
微博一下

摘要

Text classification is a major data mining task. An advanced text classification technique is known as partially supervised text classification, which can build a text classifier using a small set of positive examples only. This leads to our curiosity whether it is possible to find a set of features that can be used to describe the positi...更多

代码

数据

简介
  • The authors study a new problem, called hot bursty events detection in a text stream, where a text stream is a sequence of chronologically ordered documents, and a hot bursty event is a minimal set of bursty features that occur together in certain time windows with strong support of documents in the text stream.
  • The hot bursty events detection can be possibly handled by clustering of documents followed by a step of selecting features from the clusters found.
  • The authors propose a new novel feature-pivot clustering approach for hot bursty events detection.
重点内容
  • We study a new problem, called hot bursty events detection in a text stream, where a text stream is a sequence of chronologically ordered documents, and a hot bursty event is a minimal set of bursty features that occur together in certain time windows with strong support of documents in the text stream
  • With our techniques, users do not even need to specify a set of positive examples to build a text classifier
  • We studied a new problem, called hot bursty events detection in a sequence of chronologically ordered documents, where a bursty event is a set of bursty features appearing in certain time windows
  • Except for the first issue, which is the focus of this paper, the other issues have been addressed in the recent papers, and are known as partially supervised text classification
  • We proposed a parameter free probabilistic approach for effectively and efficiently identifying bursty events, called feature-pivot clustering approach
结果
  • Section 4 shows that the parameter free feature-pivot clustering approach can detect the bursty events with a high success rate.
  • The authors address the issues behind the document-pivot clustering approach which makes hot bursty events detection difficult.
  • For detecting hot bursty events, the document-pivot clustering approach first assigns weights to the features based on the most widely-used tf · idf schema [15].
  • The tf · idf schema does not suit for the purposes for hot bursty events detection, because the authors need to find the features that appear in a large number of documents in certain hot periods, so as to distinguish the set of documents that contain the burst features from the other documents.
  • Suppose that fj is a bursty feature in a bursty event Ek, such that fj only appears with high frequency in the hot periods of Ek. It implies ni,j is large in the time window Wi where Ek is a bursty event.
  • If a bursty event Ek contains many features that appear in different sets of documents, P (D|Ek) becomes small, which
  • |B| j=0 ej where |Bk| is the number of bursty features in Ek. In this paper, the authors say a bursty event Ek is hot in Wi, if Pb(i, Ek) > β, where β is set as 2 times of the standard deviation above the expected value of Pb(i, Ek) for i = 1, 2, · · · .
  • The authors concentrated on the novel feature-pivot clustering approach, and do not show the results using document-pivot clustering, because there are no reported studies providing details for them to fine tune parameters for grouping bursty features.
  • It is important to notice that no bursty features can be observed using document-pivot clustering approach, if they appear continuously like Law in the example.
  • The authors studied a new problem, called hot bursty events detection in a sequence of chronologically ordered documents, where a bursty event is a set of bursty features appearing in certain time windows.
结论
  • The authors proposed a parameter free probabilistic approach for effectively and efficiently identifying bursty events, called feature-pivot clustering approach.
  • The testing results showed that the parameter free feature-pivot clustering approach can detect the bursty events with a high success rate.
  • It is important to know that it can be achieved without parameter tuning and estimation
相关工作
  • Topic detection and tracking (TDT) is the major area that tackles the problem of discovering events from a stream of news stories [2, 18, 27, 26, 3, 4, 27, 26, 21]. They all use similar techniques for event detection, that is to cluster similar documents together to form events. We discussed in Section 2 that this approach cannot be directly applied to our hot bursty events detection. In addition to the quality issue whether it can find bursty events, there is an efficiency issue. The size of the corpus usually makes the clustering problem become difficult. The work in [21, 27, 26] attempted to improve the efficiency of clustering, however, it further introduces more parameters to be tuned.
研究对象与分析
consecutive documents: 8
Figure 1 illustrates an example. Suppose there are eight consecutive documents, from A to H, where the documents A, D, E, G and H support the same event X. During document clustering, suppose that the documents A, D and G are initially grouped together in a cluster, G1, and the documents E and H are initially grouped together in another cluster, G2

cases: 3
Let Pb(i, fj) be the probability that the feature fj is burst in the time window Wi. We consider three cases below. • When ni,j is in RA, it implies that Po(ni,j) ≤ pj

cases: 3
We consider fj as a bursty feature in Wi, and let Pb(i, fj) = 1. • When ni,j is in RB, there are further three cases. When ni,j approaches the boundary of RB and RC , the corresponding feature fj will be a bursty feature; when ni,j approaches the boundary of RB and RA, fj will be a non-bursty feature; and when ni,j is on the mid-point of region RB (the point q in Figure 3), fj can either be bursty or not bursty

documents: 153
• Grouping bursty features Sars and Iraq: The total numbers of documents that contain the bursty feature Sars and Iraq during the bursty period are |DSars| = 3, 240 and |DIraq| = 2, 404, respectively. In total, there are 153 documents reporting both events at the same time, such as |DSars ∩ DIraq| = 153, and there are 5, 491 documents that contain either Sars or Iraq such as |M | = |DSars ∪ DIraq| = 5, 491 (Eq (11)). Consider whether Sars and Iraq shall be grouped

documents: 1854
• Grouping Sars and Outbreak: The total numbers of documents that contain the bursty feature Sars and Outbreak during the bursty period are |DSars| = 3, 240 and |Doutbreak| = 2, 254, respectively. In total, there are 1, 854 documents reporting both events at the same time, such as |DSars ∩ DOutbreak| = 1, 854, and there are 3, 640 documents that contain either Sars or Outbreak such as |M | = |DSars ∪ DOutbreak| = 3, 640 (Eq (11). If Sars and Outbreak are grouped together, P (D|Ek) = (3240/3640) × (2254/3640) = 0.551

people: 300000
The bursty feature Rally shows the similarity to the bursty feature Article, because demonstration was usually associated with rally. The major difference between the feature distribution of Rally and Article is that Rally has a different burst period on 2nd July 2004, because on 1st July 2004, there was another massive demonstration, which included over 300,000 people. In short, all the bursty features are strongly interrelated to each others

引用论文
  • R. Agrawal, K.-I. Lin, H. S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proceedings of 21th International Conference on Very Large Data Bases (VLDB’95), 1995.
    Google ScholarLocate open access versionFindings
  • J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of the 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’98), 1998.
    Google ScholarLocate open access versionFindings
  • T. Brants and F. Chen. A system for new event detection. In Proceedings of the 26th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’03), 2003.
    Google ScholarLocate open access versionFindings
  • M. Connell, A. Feng, G. Kumaran, H. Raghavan, C. Shah, and J. Allan. UMass at tdt 200In 2004 Topic Detection and Tracking Workshop (TDT’04), 2004.
    Google ScholarLocate open access versionFindings
  • C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD’94), 1994.
    Google ScholarLocate open access versionFindings
  • G. P. C. Fung, J. X. Yu, H. Lu, and P. S. Yu. Text classification without negative labeled examples. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), 2005.
    Google ScholarLocate open access versionFindings
  • G. J. F. Jones and S. M. Gabb. A visualisation tool for topic tracking analysis and development. In Proceedings of the 25th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’02), 2002.
    Google ScholarLocate open access versionFindings
  • E. J. Keogh and P. Smyth. A probabilistic approach to fast pattern matching in time series databases. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD’97), 1997.
    Google ScholarLocate open access versionFindings
  • J. M. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (KDD’02), 2002.
    Google ScholarLocate open access versionFindings
  • X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In Proceedings of 2003 International Joint Conference on Artificial Intelligence (IJCAL’03), 2003.
    Google ScholarLocate open access versionFindings
  • N. E. Miller, P. C. Wong, M. Brewster, and H. Foote. Topic islands – a wavelet-based text visualization system. In Proceedings of the 9th IEEE Visualization, 1998.
    Google ScholarLocate open access versionFindings
  • D. C. Montogomery and G. C. Runger. Applied Statistics and Probability for Engineers. John Wiley & Sons, Inc., second edition, 1999.
    Google ScholarFindings
  • S. Morinaga and K. Yamanishi. Tracking dynamics of topic trends using a finite mixture model. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD’04), 2004.
    Google ScholarLocate open access versionFindings
  • R. Papka and J. Allan. On-line new event detection using single pass clustering. Technical Report IR–123, Department of Computer Science, University of Massachusetts, 1998.
    Google ScholarFindings
  • G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24 (5):513–523, 1988.
    Google ScholarLocate open access versionFindings
  • F. Seabastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34 (1):1–47, 2002.
    Google ScholarLocate open access versionFindings
  • D. A. Smith. Detecting and browsing events in unstructured text. In Proceedings of the 25th ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’02), 2002.
    Google ScholarLocate open access versionFindings
  • M. Spitters and W. Kraaij. TNO at TDT2001: Language model-based topic detection. In 2001 Topic Detection and Tracking Workshop (TDT’01), 2001.
    Google ScholarLocate open access versionFindings
  • R. C. Swan and J. Allan. Extracting significant time varying features from text. In Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM’98), 1998.
    Google ScholarLocate open access versionFindings
  • R. C. Swan and J. Allan. Automatic generation of overview timelines. In Proceedings of the 23rd ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’00), 2000.
    Google ScholarLocate open access versionFindings
  • D. Trieschnigg and W. Kraaij. Hierarchical topic detection in large digital news archives. In Proceedings of the 5th Dutch Belgian Information Retrieval workshop, 2005.
    Google ScholarLocate open access versionFindings
  • M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, periodicities and bursts for online search queries. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD’04), 2004.
    Google ScholarLocate open access versionFindings
  • P. Willett. Recent trends in hierarchic document clustering: A critical review. Information Processing and Management, 24 (5):577–597, 1988.
    Google ScholarLocate open access versionFindings
  • P. C. Wong, W. Cowley, H. Foote, E. Jurrus, and J. Thomas. Visualizing sequential patterns for text mining. In Proceedings of the 2000 IEEE Symposium on Information Visualization, 2000.
    Google ScholarLocate open access versionFindings
  • Y. Yang, T. Ault, T. Pierce, and C. W. Lattimer. A study on thresholding strategies for text categorization. In Proceedings of the 23rd ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’00), 2000.
    Google ScholarLocate open access versionFindings
  • Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. T. Archibald, and X. Liu. Learning approaches for detecting and tracking news events. IEEE Intelligent Systems, 14 (4):32–43, 1999.
    Google ScholarLocate open access versionFindings
  • Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and on-line event detection. In Proceedings of the 21st ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’98), 1998.
    Google ScholarLocate open access versionFindings
0
您的评分 :

暂无评分

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn