AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The most interesting result of our evaluation was the very good performance of the two nb versions that have been used less in spam filtering, i.e., fb and the multinomial nb with Boolean attributes; these two versions collectively obtained the best results in our experiments

Spam Filtering with Naive Bayes - Which Naive Bayes?

CEAS, (2006)

Cited: 367|Views27
EI
Full Text
Bibtex
Weibo

Abstract

Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five dierent versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular En...More

Code:

Data:

Introduction
  • Several machine learning algorithms have been employed in anti-spam e-mail filtering, including algorithms that are considered top-performers in text classification, like Boosting and Support Vector Machines, Naive Bayes classifiers currently appear to be popular in commercial and open-source spam filters.
  • In further work on text classification, which included experiments on Ling-Spam, Schneider [25] found that the multinomial nb surprisingly performs even better when term frequencies are replaced by Boolean attributes
Highlights
  • Several machine learning algorithms have been employed in anti-spam e-mail filtering, including algorithms that are considered top-performers in text classification, like Boosting and Support Vector Machines, Naive Bayes classifiers currently appear to be popular in commercial and open-source spam filters
  • There are, several forms of nb classifiers, and the anti-spam literature does not always acknowledge this. In their seminal papers on learning-based spam filters, Sahami et al [21] used a nb classifier with a multi-variate Bernoulli model, a form of nb that relies on Boolean attributes, whereas Pantel and Lin [19] in effect adopted the multinomial form of nb, which normally takes into account term frequencies
  • In order to evaluate the different nb versions across the entire tradeoff between true positives and true negatives, we present the evaluation results by means of roc curves, plotting sensitivity against 1− specificity
  • The most interesting result of our evaluation was the very good performance of the two nb versions that have been used less in spam filtering, i.e., fb and the multinomial nb with Boolean attributes; these two versions collectively obtained the best results in our experiments
Results
  • 4.1 Size of attribute set

    The authors first examined the impact of the number of attributes on the effectiveness of the five nb versions.14 As mentioned above, the authors experimented with 500, 1000, and 3000 attributes.
  • 4.1 Size of attribute set.
  • The authors first examined the impact of the number of attributes on the effectiveness of the five nb versions.14.
  • The differences in effectiveness across different numbers of attributes, are rather insignificant.
  • Tables 2 and 3 show the maximum differences in spam and ham recall, respectively, across the three sizes of the attribute set, for each nb version and dataset, with T = 0.5; note that the differences are in percentage points.
  • In operational filters the differences in effectiveness may not justify the increased computational cost that larger attribute sets require, even though the increase in computational cost is linear in the number of attributes
Conclusion
  • CONCLUSIONS AND FURTHER

    WORK

    The authors discussed and evaluated experimentally in a spam filtering context five different versions of the Naive Bayes classifier.
  • The authors' investigation included two versions of nb that have not been used widely in the spam filtering literature, namely Flexible Bayes and the multinomial nb with Boolean attributes.
  • The most interesting result of the evaluation was the very good performance of the two nb versions that have been used less in spam filtering, i.e., fb and the multinomial nb with Boolean attributes; these two versions collectively obtained the best results in the experiments.
  • The best results in terms of effectiveness were generally achieved with the largest attribute set (3000 attributes), as one might have expected, but the gain was rather insignificant, compared to smaller and computationally cheaper attribute sets
Tables
  • Table1: Composition of the six benchmark datasets
  • Table2: Maximum difference (×100) in spam recall across 500, 1000, 3000 attributes for T = 0.5
  • Table3: Maximum difference (×100) in ham recall across 500, 1000, 3000 attributes for T = 0.5
  • Table4: Spam recall (%) for 3000 attributes, T = 0.5
  • Table5: Ham recall (%) for 3000 attributes, T = 0.5
Download tables as Excel
Funding
  • Found that fb clearly outperforms the multi-variate Gauss nb on the pu corpora, when the attributes are term frequencies divided by document lengths, but did not compare fb against the other nb versions
  • Shed more light on the five versions of nb mentioned above, and evaluates them experimentally on six new, non-encoded datasets, collectively called EnronSpam, which makes publicly available
Study subjects and analysis
Enron employees: 150
However, the loss of the original tokens still imposes restrictions; for example, it is impossible to experiment with different tokenizers. Following the Enron investigation, the personal files of approximately 150 Enron employees were made publicly available.8. The files included a large number of personal e-mail messages, which have been used to create e-mail classification benchmarks [3, 15], including a public benchmark corpus for the trec 2005 Spam Track.9

Enron employees: 6
Hence, the experiments corresponded to the scenario where a single filter is trained on a collection of messages received by many different users, as opposed to using personalized filters. As we were more interested in personalized spam filters, we focussed on six Enron employees who had large mail-. boxes

datasets: 6
The total number of messages in each dataset is between five and six thousand. The six datasets emulate different situations faced by real users, allowing us to obtain a more complete picture of the performance of learning-based filters. Table 1 summarizes the characteristics of the six datasets

datasets: 6
The six datasets emulate different situations faced by real users, allowing us to obtain a more complete picture of the performance of learning-based filters. Table 1 summarizes the characteristics of the six datasets. Hereafter, we refer to the first, second, . . . , sixth dataset of Table 1 as Enron1, Enron2, . . . , Enron6, respectively

datasets: 6
Hereafter, we refer to the first, second, . . . , sixth dataset of Table 1 as Enron1, Enron2, . . . , Enron6, respectively. In addition to what was mentioned above, the six datasets were subjected to the following pre-processing steps. First, we removed messages sent by the owner of the mailbox (we checked if the address of the owner appeared in the ‘To:’, ‘Cc:’, or ‘Bcc:’ fields), since we believe e-mail users are increasingly adopting better ways to keep copies of outgoing messages

datasets: 6
Hence, an incremental retraining and evaluation procedure that also takes into account the characteristics of spam that vary over time is essential when comparing different learning algorithms in spam filtering. In order to realize this incremental procedure with the use of our six datasets, we needed to order the messages of each dataset in a way that preserves the original order of arrival of the messages in each category; i.e., each spam message must be preceded by all spam messages that arrived earlier, and the same applies to ham messages. The varying ham-ratio ratio over time also had to be emulated. (The reader is reminded that the spam and ham messages of each dataset are from different time periods

explains how the datasets: 3
Section 2 below presents the event models and assumptions of the nb versions we considered. Section 3 explains how the datasets of our experiments were assembled and the evaluation methodology we used; it also highlights some pitfalls that have to be avoided when constructing spam filtering benchmarks. Section 4 then presents and discusses our experimental results

Enron employees: 150
However, the loss of the original tokens still imposes restrictions; for example, it is impossible to experiment with different tokenizers. Following the Enron investigation, the personal files of approximately 150 Enron employees were made publicly available.8. The files included a large number of personal e-mail messages, which have been used to create e-mail classification benchmarks [3, 15], including a public benchmark corpus for the trec 2005 Spam Track.9

Enron employees: 6
Hence, the experiments corresponded to the scenario where a single filter is trained on a collection of messages received by many different users, as opposed to using personalized filters. As we were more interested in personalized spam filters, we focussed on six Enron employees who had large mail-. 6The SpamAssassin corpus and Spambase are available from http://www.spamassassin.org/publiccorpus/ and http://www.ics.uci.edu/∼mlearn/MLRepository.html. 7See http://www.ecmlpkdd2006.org/challenge.html. 8See http://fercic.aspensys.com/members/manager.asp. 9Consult http://plg.uwaterloo.ca/ gvcormac/spam/ for further details

datasets: 6
There does not seem to be a clear justification for these differences, in terms of the hamspam ratio or the spam source used in each dataset. Despite its theoretical association to term frequencies, in all six datasets the multinomial nb seems to be doing better when Boolean attributes are used, which agrees with Schneider’s observations [25]. The difference, however, is in most cases very small and not always statistically significant; it is clearer in the first dataset and, to a lesser extent, in the last one

datasets: 6
The difference, however, is in most cases very small and not always statistically significant; it is clearer in the first dataset and, to a lesser extent, in the last one. Furthermore, the multinomial nb with Boolean attributes seems to be the best performer in 4 out of 6 datasets, although again by a small and not always statistically significant margin, and it is clearly outperformed only by fb in the other 2 datasets. This is particularly interesting, since many nb-based spam filters appear to adopt the multinomial nb with tf attributes or the multi-variate Bernoulli nb (which uses Boolean attributes); the latter seems to be the worst among the nb versions we evaluated

datasets: 2
The fb classifier shows signs of impressive superiority in Enron1 and Enron2; and its performance is almost undistinguishable from that of the top performers in Enron5 and Enron6. However, it does not perform equally well, compared to the top performers, in the other two datasets (Enron3, Enron4), which strangely include what appears to be the easiest dataset (Enron4). One problem we noticed with fb is that its estimates for p(c | x) are very close to 0 or 1; hence, varying the threshold T has no effect on the classification of many messages

datasets: 6
Taking into account its smoother trade-off between ham and spam recall, and its better computational complexity at run time, we tend to prefer the multinomial nb with Boolean attributes over fb, but further experiments are necessary to establish its superiority with confidence. For completeness, Tables 4 and 5 list the spam and ham recall, respectively, of the nb versions on the 6 datasets for T = 0.5, although comparing at a fixed threshold T is not particularly informative; for example, two methods may obtain the same results at different thresholds. On average, the multinomial nb with Boolean attributes again has the best results, both in spam and ham recall

datasets: 6
4.3 Learning curves. Figure 3 shows the learning curves (spam and ham recall as more training messages are accumulated over time) of the multinomial nb with Boolean attributes on the six datasets for T = 0.5. It is interesting to observe that the curves do not increase monotonically, unlike most text classification experiments, presumably because of the unpredictable fluctuation of the ham-spam ratio, the changing topics of spam, and the adversarial nature of anti-spam filtering

datasets: 6
We emulated the situation faced by a new user of a personalized learning-based spam filter, adopting an incremental retraining and evaluation procedure. The six datasets that we used, and which we make publicly available, were created by mixing freely available ham and spam messages in different proportions. The mixing procedure emulates the unpredictable fluctuation over time of the hamspam ratio in real mailboxes

Reference
  • I. Androutsopoulos, J. Koutsias, K. Chandrinos, and C. Spyropoulos. An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In 23rd ACM SIGIR Conference, pages 160–167, Athens, Greece, 2000.
    Google ScholarLocate open access versionFindings
  • I. Androutsopoulos, G. Paliouras, and E. Michelakis. Learning to filter unsolicited commercial e-mail. technical report 2004/2, NCSR “Demokritos”, 2004.
    Google ScholarFindings
  • R. Beckermann, A. McCallum, and G. Huang. Automatic categorization of email into folders: benchmark experiments on Enron and SRI corpora. Technical report IR-418, University of Massachusetts Amherst, 2004.
    Google ScholarFindings
  • X. Carreras and L. Marquez. Boosting trees for anti-spam email filtering. In 4th International Conference on Recent Advances in Natural Language Processing, pages 58–64, Tzigov Chark, Bulgaria, 2001.
    Google ScholarLocate open access versionFindings
  • P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3):103130, 1997.
    Google ScholarLocate open access versionFindings
  • H. D. Drucker, D. Wu, and V. Vapnik. Support Vector Machines for spam categorization. IEEE Transactions On Neural Networks, 10(5):1048–1054, 1999.
    Google ScholarLocate open access versionFindings
  • S. Eyheramendy, D. Lewis, and D. Madigan. On the Naive Bayes model for text categorization. In 9th International Workshop on Artificial Intelligence and Statistics, pages 332–339, Key West, Florida, 2003.
    Google ScholarLocate open access versionFindings
  • [9] S. Hershkop and S. Stolfo. Combining email models for false positive reduction. In 11th ACM SIGKDD Conference, pages 98–107, Chicago, Illinois, 2005.
    Google ScholarLocate open access versionFindings
  • [11] J. G. Hidalgo and M. M. Lopez. Combining text and heuristics for cost-sensitive spam filtering. In 4th Computational Natural Language Learning Workshop, pages 99–102, Lisbon, Portugal, 2000.
    Google ScholarLocate open access versionFindings
  • [14] G. John and P. Langley. Estimating continuous distributions in Bayesian classifiers. In 11th Conference on Uncertainty in Artificial Intelligence, pages 338–345, Montreal, Quebec, 1995.
    Google ScholarLocate open access versionFindings
  • [15] B. Klimt and Y. Yang. The Enron corpus: a new dataset for email classification research. In 15th European Conference on Machine Learning and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 217–226, Pisa, Italy, 2004.
    Google ScholarLocate open access versionFindings
  • [16] A. Kolcz and J. Alspector. SVM-based filtering of e-mail spam with content-specific misclassification costs. In Workshop on Text Mining, IEEE International Conference on Data Mining, San Jose, California, 2001.
    Google ScholarLocate open access versionFindings
  • [17] A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI’98 Workshop on Learning for Text Categorization, pages 41–48, Madison, Wisconsin, 1998.
    Google ScholarFindings
  • [18] E. Michelakis, I. Androutsopoulos, G. Paliouras, G. Sakkis, and P. Stamatopoulos. Filtron: a learning-based anti-spam filter. In 1st Conference on Email and Anti-Spam, Mountain View, CA, 2004.
    Google ScholarFindings
  • [19] P. Pantel and D. Lin. SpamCop: a spam classification and organization program. In Learning for Text Categorization – Papers from the AAAI Workshop, pages 95–98, Madison, Wisconsin, 1998.
    Google ScholarFindings
  • [20] F. Peng, D. Schuurmans, and S. Wang. Augmenting naive bayes classifiers with statistical language models. Information Retrieval, 7:317–345, 2004.
    Google ScholarLocate open access versionFindings
  • [21] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization – Papers from the AAAI Workshop, pages 55–62, Madison, Wisconsin, 1998.
    Google ScholarLocate open access versionFindings
  • [22] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. Spyropoulos, and P. Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. In Conference on Empirical Methods in Natural Language Processing, pages 44–50, Carnegie Mellon University, Pittsburgh, PA, 2001.
    Google ScholarLocate open access versionFindings
  • [23] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. Spyropoulos, and P. Stamatopoulos. A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval, 6(1):49–73, 2003.
    Google ScholarLocate open access versionFindings
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn