The evaluations indicate that experimental design active learning of logistic regression is one of the more robust strategies available
Active learning for logistic regression: an evaluation
Machine Learning, no. 3 (2007): 235-265
Which active learning methods can we expect to yield good performance in learning binary and multi-category logistic regression classifiers? Addressing this question is a natural first step in providing robust solutions for active learning across a wide variety of exponential models including maximum entropy, generalized linear, log-linea...更多
下载 PDF 全文
- Procurement of labeled training data is the seminal step of training a supervised machine learning algorithm.
- A trend of the last ten years (Abe and Mamitsuka 1998; Banko and Brill 2001; Chen et al.2006; Dagan and Engelson 1995; Hwa 2004; Lewis and Gale 1994; McCallum and Nigam 1998; Melville and Mooney 2004; Roy and McCallum 2001; Tang et al 2002) has been to employ heuristic methods of active learning with no explicitly defined objective function.
- A subtrend in the field has sought to improve performance of heuristics by combining them with secondary heuristics such as: similarity weighting (McCallum and Nigam 1998), interleaving active learning with EM (McCallum and Nigam 1998), interleaving active learning with co-training (Steedman et al 2003), and sampling from clusters (Tang et al 2002), among others
- Procurement of labeled training data is the seminal step of training a supervised machine learning algorithm
- Focus soon turned to methods applicable to pool-based active learning including the query by committee method (Seung et al 1992) and experimental design methods based on A-optimality (Cohn 1996)
- The evaluations indicate that experimental design active learning of logistic regression is one of the more robust strategies available
- Future work in active learning using logistic regression will benefit from evaluating against these gold standard methods
- Throughout the active learning literature, we found statements to the effect that these methods are too computationally expensive to evaluate, but our results demonstrate that experimental design approaches are tractable for many data sets
- The experimental design methods have the disadvantage of memory and computational complexity, and we were unable to evaluate them on two of the larger document classification tasks
- The evaluations in this study have specific goals: to discover which methods work in addition to why methods perform badly when they do
- Towards this end, the authors assembled a suite of machine learning data sets consisting of a diverse number of predictors, categories and domains.
- Table 4 contains the result of a hypothesis test on the mean stopping point accuracy: comparing different alternatives to random sampling, while Table 5 presents the same experiments in terms of the percent of the number of random examples needed to generate stopping point accuracy of each method
- The evaluations indicate that experimental design active learning of logistic regression is one of the more robust strategies available.
- The experimental design methods produced attractive results much of the time without ever performing worse than random sampling.
- This can be seen by the hypothesis testing results and the deficiency measurements in Table 6.
- The result is so surprising that a separate section (5.5) is included to explore whether negative heuristic performance is an artifact of an “unlucky” evaluation design
- Table1: Notation used in the decomposition of squared error
- Table2: Descriptions of the data sets used in the evaluation. Included are counts of: the number of categories (Classes), the number of observations (Obs), the test set size after splitting the data set into pool/test sets (Test), the number of predictors (Pred), the number of observations in the majority category (Maj), and the training set stopping point for the evaluation (Stop)
- Table3: Average accuracy and squared error ((18), left hand side) results for the tested data sets when the entire pool is used as the training set. The data sets are sorted by squared error as detailed in Sect. 5.4
- Table4: Results of hypothesis tests comparing bagging and seven active learning method accuracies to random sampling at the final training set size. ‘+’ indicates statistically significant improvement and ‘–’ indicates statistically significant deterioration. ‘NA’ indicates ‘not applicable.’ Figures 2–5 display the actual results used for hypothesis testing as a box plot
- Table5: Results comparing random sampling, bagging, and seven active learning methods reported as the percentage of random examples over (or under) the final training set size needed to give similar accuracies. Active learning methods were seeded with 20 random examples, and stopped when training set sizes reached final tested size (300 observations with exceptions; see Sect. 5.3 for details on the rationale for different stopping points)
- Table6: Average deficiency (see (47)) achieved by the various methods. For each dataset the winner appears in boldface and marked with a star. The runner-up appears in boldface
- Table7: The average percentage of matching test set margins when comparing models trained on data sets of size 300 to a model trained on the entire pool. Margins match if they are formed from the same pair of categories. Ten repetitions of the experiment produce the averages below
- Table8: Results of hypothesis tests comparing four heuristic active learning method accuracies to random sampling at the final training set size. These active learners used the larger candidate size of 300. ‘+’ indicates statistically significant improvement and ‘–’ indicates statistically significant deterioration compared to random sampling. ‘NA’ indicates ‘not applicable’
- Table9: Average deficiency (see (47)) achieved by the various methods using a larger candidate size of 300. For each dataset the winner appears in boldface and marked with a star. The runner-up appears in boldface
- Table10: Results of hypothesis tests comparing bagging and four active learning method accuracies to random sampling at training set size 600. ‘+’ indicates statistically significant improvement and ‘–’ indicates statistically significant deterioration. ‘NA’ indicates ‘not applicable’
- Table11: Average deficiency (see (47)) achieved by the various methods beginning at 300 observations and ending at 600. For each dataset the winner appears in boldface and marked with a star. The runner-up appears in boldface
- Table12: Results of hypothesis tests comparing bagging and two query by bagging methods using a bag size of 15. ‘+’ indicates statistically significant improvement and ‘–’ indicates statistically significant deterioration. ‘NA’ indicates ‘not applicable’
- Table13: Average deficiency (see (47)) achieved by bagging and the two query by bagging methods using bag size 15. For each dataset the winner appears in boldface and marked with a star. The runner-up appears in boldface
- Andrew Schein was supported by NSF grant ITR-0205448
- Abe, N., & Mamitsuka, H. (1998). Query learning strategies using boosting and bagging. In Proceedings of the 15th international conference on machine learning (ICML1998) (pp. 1–10).
- Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75, 87–106.
- Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39’th annual ACL meeting (ACL2001).
- Baum, E. B. (1991). Neural net algorithms that learn in polynomial time from examples and queries. IEEE Transactions on Neural Networks, 2(1).
- Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.
- Bickel, P. J., & Doksum, K. A. (2001). Mathematical statistics (2nd ed., Vol. 1). Englewood Cliffs: Prentice Hall.
- Blake, C., & Merz, C. (1998). UCI repository of machine learning databases.
- Boram, Y., El-Yaniv, R., & Luz, K. (2003). Online choice of active learning algorithms. In Twentieth international conference on machine learning (ICML-2003).
- Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
- Buja, A., Stuetzle, W., & Shen, Y. (2005). Degrees of boosting: a study of loss functions for classification and class probability estimation. Working paper.
- Chaloner, K., & Larntz, K. (1989). Optimal Bayesian design applied to logistic regression experiments. Journal of Statistical Planning and Inference, 21, 191–208.
- Chen, J., Schein, A. I., Ungar, L. H., & Palmer, M. S. (2006). An empirical study of the behavior of active learning for word sense disambiguation. In Proceedings of the 2006 human language technology conference—North American chapter of the association for computational linguistics annual meeting HLT-NAACL 2006.
- Cohn, D. A. (1996). Neural network exploration using optimal experimental design. Neural Networks, 9(6), 1071–1083.
- Cohn, D. A. (1997). Minimizing statistical bias with queries. In Advances in neural information processing systems 9. Cambridge: MIT Press.
- Craven, M., DiPasquo, D., Freitag, D., McCallum, A. K., Mitchell, T. M., & Nigam, K. et al. (2000). Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1/2), 69–113.
- Dagan, I., & Engelson, S. P. (1995). Committee-based sampling for training probabilistic classifiers. In International conference on machine learning (pp. 150–157).
- Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43, 1470–1480.
- Davis, R., & Prieditis, A. (1999). Designing optimal sequential experiments for a Bayesian classifier. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(3). Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133–168.
- Frey, P. W., & Slate, D. J. (1991). Letter recognition using Holland-style adaptive classifiers. Machine Learning, 6(2). Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., & Dahlgren, N. (1993). Darpa timit acousticphonetic continuous speech corpus CD-ROM. NIST. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58.
- Gilad-Bachrach, R., Navot, A., & Tishby, N. (2003). Kernel query by committee (KQBC) (Tech. Rep. No. 2003-88). Leibniz Center, the Hebrew University.
- Hosmer, D. E., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley.
- Hwa, R. (2004). Sample selection for statistical parsing. Computational Linguistics, 30(3). Hwang, J.-N., Choi, J. J., Oh, S., & Marks, R. J. (1991). Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks, 2(1). Jin, R., Yan, R., Zhang, J., & Hauptmann, A. G. (2003). A faster iterative scaling algorithm for conditional exponential model. In Proceedings of the twentieth international conference on machine learning (ICML2001), Washington, DC. Kaynak, C. (1995). Methods of combining multiple classifiers and their applications to handwritten digit recognition. Unpublished master’s thesis, Bogazici University.
- Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning (pp. 282–289). Los Altos: Kaufmann.
- Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. In W.B. Croft & C.J. van Rijsbergen (Eds.), Proceedings of SIGIR-94, 17th ACM international conference on research and development in information retrieval (pp. 3–12), Dublin. Heidelberg: Springer.
- MacKay, D. J. C. (1991). Bayesian methods for adaptive models. Unpublished doctoral dissertation, California Institute of Technology.
- MacKay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation, 4(5), 698–714.
- Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation.
- McCallum, A., & Nigam, K. (1998). Employing em in pool-based active learning for text classification. In Proceedings of the 15th international conference on machine learning (ICML1998).
- McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Boca Raton: CRC Press.
- Melville, P., & Mooney, R. (2004). Diverse ensembles for active learning. In Proceedings of the 21st international conference on machine learning (ICML-2004) (pp. 584–591).
- Mitchell, T. M. (1997). Machine learning. New York: McGraw–Hill. Nigam, K., Lafferty, J., & McCallum, A. (1999). Using maximum entropy for text classification. In IJCAI-99 workshop on machine learning for information filtering.
- Nocedal, J., & Wright, S. J. (1999). Numerical optimization. Berlin: Springer.
- Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th international conference on machine learning (pp. 441–448). San Francisco: Kaufmann.
- Saar-Tsechansky, M., & Provost, F. (2001). Active learning for class probability estimation and ranking. In Proceedings of the international joint conference on artificial intelligence (pp. 911–920).
- Schein, A. I. (2005). Active learning for logistic regression. Dissertation in Computer and Information Science, The University of Pennsylvania.
- Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by committee. In Computational learning theory (pp. 287–294).
- Steedman, M., Hwa, R., Clark, S., Osborne, M., Sarkar, A., & Hockenmaier, J. (2003). Example selection for bootstrapping statistical parsers. In Proceedings of the annual meeting of the North American chapter of the ACL, Edmonton, Canada.
- Tang, M., Luo, X., & Roukos, S. (2002). Active learning for statistical natural language parsing. In ACL 2002.
- Zheng, Z., & Padmanabhan, B. (2006). Selectively acquiring customer information: A new data acquisition problem and an active learning-based solution. Management Science, 52(5), 697–712.