Class-Weighted Classification: Trade-offs and Robust Approaches

Xu Ziyu
Xu Ziyu
Khim Justin
Khim Justin

ICML, pp. 10544-10554, 2020.

Cited by: 0|Views47
EI
Weibo:
We note that while the worst class risk of label conditional value at risk and label heterogeneous conditional value at risk seem to decrease with greater imbalance, this may not be a general property of these methods

Abstract:

We address imbalanced classification, the problem in which a label may have low marginal probability relative to other labels, by weighting losses according to the correct class. First, we examine the convergence rates of the expected excess weighted risk of plug-in classifiers where the weighting for the plug-in classifier and the risk...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Classification is a fundamental problem in statistics and machine learning, including scientific problems such as cancer diagnosis and satellite image processing as well as engineering applications such as credit card fraud detection, handwritten digit recognition, and text processing (Khan et al, 2001; Lee et al, 2004), but modern applications have brought new challenges.
  • Image data a long tail of many classes with few examples (Salakhutdinov et al, 2011; Zhu et al, 2014).
  • In such settings, the classes with smaller probabilities are generally classified incorrectly more often, and this is undesirable when the smaller classes are important, such as rare forms of cancer, fraudulent credit card transactions, and expensive online purchases.
  • The authors need modern classification methods that work well when there are a large number of classes and when the class-wise probabilities are imbalanced
Highlights
  • Classification is a fundamental problem in statistics and machine learning, including scientific problems such as cancer diagnosis and satellite image processing as well as engineering applications such as credit card fraud detection, handwritten digit recognition, and text processing (Khan et al, 2001; Lee et al, 2004), but modern applications have brought new challenges
  • We examine the empirical performance of label conditional value at risk and label heterogeneous conditional value at risk risks, and compare them against the standard risk and a balanced risk as baselines
  • Note that the more significant the imbalance, i.e., the smaller the, the better label conditional value at risk and label heterogeneous conditional value at risk perform compared to balanced risk on class 0, while paying a progressively smaller price on the class 1 risk
  • We note that while the worst class risk of label conditional value at risk and label heterogeneous conditional value at risk seem to decrease with greater imbalance, this may not be a general property of these methods
  • We subsequently show that optimizing with respect to label conditional value at risk and label heterogeneous conditional value at risk empirically improves the worst class risk, at a reasonable cost to accuracy
  • If we formalize each prior over the classes as a weighting, optimizing label conditional value at risk or label heterogeneous conditional value at risk may improve performance when the test class priors are different from the training class priors
Methods
  • Method Standard Risk Worst Class Risk 0.01 N/A LCVaR 0.05 N/A 0.1 N/A LHCVaR 0.05 1.
  • Improving worst class risk comes at a cost to to the standard risk in the case of both LCVaR and LHCVaR.
  • This tradeoff is reflected in the histograms of class risk shown in Fig. 4, where the class risks under the standard and balanced classifiers are more spread out and have classes with much lower risks.
Results
  • 5.1 Methods

    The authors examine the empirical performance of LCVaR and LHCVaR risks, and compare them against the standard risk and a balanced risk as baselines.
  • The authors note that while the worst class risk of LCVaR and LHCVaR seem to decrease with greater imbalance, this may not be a general property of these methods.
  • The main observation is that LCVaR and LHCVaR have lower worst class risk in comparison to the baseline methods
  • This empirically demonstrates that both LCVaR and LHCVaR can significantly improve the highest class risks while losing little in performance on classes with lower risks
Conclusion
  • The authors have studied the effect of optimizing classifiers with respect to different weightings and developed robust risk measures that minimizes worst case weighted risk across a set of weightings.
  • The authors subsequently show that optimizing with respect to LCVaR and LHCVaR empirically improves the worst class risk, at a reasonable cost to accuracy.
  • One future direction for research is to understand the Bayes optimal classifier under LCVaR and LHCVaR.
  • Another more applied direction could be to consider domain shift.
  • If the authors formalize each prior over the classes as a weighting, optimizing LCVaR or LHCVaR may improve performance when the test class priors are different from the training class priors
Summary
  • Introduction:

    Classification is a fundamental problem in statistics and machine learning, including scientific problems such as cancer diagnosis and satellite image processing as well as engineering applications such as credit card fraud detection, handwritten digit recognition, and text processing (Khan et al, 2001; Lee et al, 2004), but modern applications have brought new challenges.
  • Image data a long tail of many classes with few examples (Salakhutdinov et al, 2011; Zhu et al, 2014).
  • In such settings, the classes with smaller probabilities are generally classified incorrectly more often, and this is undesirable when the smaller classes are important, such as rare forms of cancer, fraudulent credit card transactions, and expensive online purchases.
  • The authors need modern classification methods that work well when there are a large number of classes and when the class-wise probabilities are imbalanced
  • Methods:

    Method Standard Risk Worst Class Risk 0.01 N/A LCVaR 0.05 N/A 0.1 N/A LHCVaR 0.05 1.
  • Improving worst class risk comes at a cost to to the standard risk in the case of both LCVaR and LHCVaR.
  • This tradeoff is reflected in the histograms of class risk shown in Fig. 4, where the class risks under the standard and balanced classifiers are more spread out and have classes with much lower risks.
  • Results:

    5.1 Methods

    The authors examine the empirical performance of LCVaR and LHCVaR risks, and compare them against the standard risk and a balanced risk as baselines.
  • The authors note that while the worst class risk of LCVaR and LHCVaR seem to decrease with greater imbalance, this may not be a general property of these methods.
  • The main observation is that LCVaR and LHCVaR have lower worst class risk in comparison to the baseline methods
  • This empirically demonstrates that both LCVaR and LHCVaR can significantly improve the highest class risks while losing little in performance on classes with lower risks
  • Conclusion:

    The authors have studied the effect of optimizing classifiers with respect to different weightings and developed robust risk measures that minimizes worst case weighted risk across a set of weightings.
  • The authors subsequently show that optimizing with respect to LCVaR and LHCVaR empirically improves the worst class risk, at a reasonable cost to accuracy.
  • One future direction for research is to understand the Bayes optimal classifier under LCVaR and LHCVaR.
  • Another more applied direction could be to consider domain shift.
  • If the authors formalize each prior over the classes as a weighting, optimizing LCVaR or LHCVaR may improve performance when the test class priors are different from the training class priors
Tables
  • Table1: Standard risk and risk of the worst class for each method on the Covertype dataset. LCVaR and LHCVaR improve on the worst class risk
  • Table2: Performance of LCVaR across different values, and LHCVaR across different values. The performance each method is relatively agnostic to choices of and , although the smallest choices of and for each method have the largest changes in worst class risk, respectively
Download tables as Excel
Related work
  • We briefly review other research related to imbalanced classification, but for a far more exhaustive treatment, see a survey of the area (He and Garcia, 2009; Fernández et al, 2018). First, two other methods may be employed to solve imbalanced classification problems. The first is class-based margin adjustment (Lin et al, 2002; Scott, 2012; Cao et al, 2019), in which the margin parameter for the margin loss function may vary by class. Broadly, margin adjustment and weighting may both be considered loss modification procedures. The second method is Neyman-Pearson classification, in which one attempts to minimize the error on one class given a constraint on the worst permissible error on the other class (Rigollet and Tong, 2011; Tong, 2013; Tong et al, 2016).

    An important topic related to our paper but that has not been well-connected to imbalanced classification is robust optimization. Robust optimization is a well-studied topic (Ben-Tal and Nemirovski, 1999, 2003; Ben-Tal et al, 2004, 2009). A variant that has gained traction more recently is distributionally robust optimization (Ben-Tal et al, 2013; Bertsimas et al, 2014; Namkoong and Duchi, 2017). Unsurprisingly, CVaR, as a coherent risk measure, has been previously connected to distributionally robust optimization (Goh and Sim, 2010). Distributionally robust optimization generally and CVaR specifically have also previously been used in machine learning to deal with imbalance (Duchi et al, 2018; Duchi and Namkoong, 2018), but in these works, the imbalance was considered to exist in the covariates, whether known to the algorithm or not. These are motivated by the recent push toward fairness in machine learning, in particular so that ethnic minorities do not suffer discrimination in high-stakes situations such as loan applications, medical diagnoses, or parole decisions, due to biases in the data.
Reference
  • J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of Statistics, 35(2):608–633, 2007.
    Google ScholarLocate open access versionFindings
  • P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
    Google ScholarLocate open access versionFindings
  • A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations research letters, 25(1):1–13, 1999.
    Google ScholarLocate open access versionFindings
  • A. Ben-Tal and A. Nemirovski. Robust solutions of linear programming problems contaminated with uncertain data. Mathematical programming, 88(3):411–424, 2003.
    Google ScholarLocate open access versionFindings
  • A. Ben-Tal, A. Goryashko, E. Guslitzer, and A. Nemirovski. Adjustable robust solutions of uncertain linear programs. Mathematical Programming, 99(2):351–376, 2004.
    Google ScholarLocate open access versionFindings
  • A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.
    Google ScholarFindings
  • A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
    Google ScholarLocate open access versionFindings
  • D. Bertsimas, V. Gupta, and N. Kallus. Robust sample average approximation. Mathematical Programming, pages 1–66, 2014.
    Google ScholarLocate open access versionFindings
  • K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbalanced datasets with labeldistribution-aware margin loss. arXiv preprint arXiv:1906.07413, 2019.
    Findings
  • K. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 3437–3445, 2014.
    Google ScholarLocate open access versionFindings
  • N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002.
    Google ScholarLocate open access versionFindings
  • L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition. Springer Science & Business Media, 1996.
    Google ScholarFindings
  • P. Domingos. Metacost: A general method for making classifiers cost-sensitive. In KDD, volume 99, pages 155–164, 1999.
    Google ScholarLocate open access versionFindings
  • D. Dua and C. Graff. Uci machine learning repository, 2017.
    Google ScholarFindings
  • J. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
    Findings
  • J. C. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. Arxiv, 2018.
    Google ScholarLocate open access versionFindings
  • V. Feldman. Does learning require memorization? a short tale about a long tail. arXiv preprint arXiv:1906.05271, 2019.
    Findings
  • A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera. Learning from imbalanced data sets. Springer, 2018.
    Google ScholarFindings
  • J. Goh and M. Sim. Distributionally robust optimization and its tractable approximations. Operations research, 58(4-part-1):902–917, 2010.
    Google ScholarLocate open access versionFindings
  • N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299, 2018.
    Google ScholarLocate open access versionFindings
  • L. Györfi. The Rate of Convergence of kn-NN Regression Estimates and Classification Rule. IEEE Transactions on Information Theory, 27(3):357–362, 1981. ISSN 0018-9448.
    Google ScholarLocate open access versionFindings
  • E. Hazan. Introduction to online convex optimization. Foundations and Trends○R in Optimization, 2(3-4):157–325, 2016.
    Google ScholarLocate open access versionFindings
  • H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
    Google ScholarLocate open access versionFindings
  • E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
    Findings
  • J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6):673, 2001.
    Google ScholarLocate open access versionFindings
  • O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent Binary Classification with Generalized Performance Metrics. In Advances in Neural Information Processing Systems 27, pages 2744–2752. Curran Associates, Inc., 2014.
    Google ScholarLocate open access versionFindings
  • A. Krzyzak and M. Pawlak. The pointwise rate of convergence of the kernel regression estimate. Journal of Statistical Planning and Inference, 16:159–166, 1987.
    Google ScholarLocate open access versionFindings
  • V. Kuznetsov, M. Mohri, and U. Syed. Rademacher complexity margin bounds for learning with a large number of classes. In ICML Workshop on Extreme Classification: Learning with a Very Large Number of Labels, 2015.
    Google ScholarFindings
  • Y. Lee, G. Wahba, and S. A. Ackerman. Cloud classification of satellite radiance data by multicategory support vector machines. Journal of Atmospheric and Oceanic Technology, 21(2):159–169, 2004.
    Google ScholarLocate open access versionFindings
  • Y. Lin, Y. Lee, and G. Wahba. Support vector machines for classification in nonstandard situations. Machine learning, 46(1-3):191–202, 2002.
    Google ScholarLocate open access versionFindings
  • Y.-C. Lin, P. Das, and A. Datta. Overview of the SIGIR 2018 eCom Rakuten Data Challenge. In eCOM@ SIGIR, 2018.
    Google ScholarLocate open access versionFindings
  • G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi. Bagan: Data augmentation with balancing gan. arXiv preprint arXiv:1803.09655, 2018.
    Findings
  • A. Menon, H. Narasimhan, S. Agarwal, and S. Chawla. On the statistical consistency of algorithms for binary classification under class imbalance. In International Conference on Machine Learning, pages 603–611, 2013.
    Google ScholarLocate open access versionFindings
  • M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
    Google ScholarFindings
  • H. Namkoong and J. C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
    Google ScholarLocate open access versionFindings
  • H. Narasimhan, R. Vaish, and S. Agarwal. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pages 1493–1501, 2014.
    Google ScholarLocate open access versionFindings
  • P. Rigollet and X. Tong. Neyman-pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 12(Oct):2831–2855, 2011.
    Google ScholarLocate open access versionFindings
  • R. T. Rockafellar, S. Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2: 21–42, 2000.
    Google ScholarLocate open access versionFindings
  • R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual appearance for multiclass object detection. In CVPR 2011, pages 1481–1488. IEEE, 2011.
    Google ScholarLocate open access versionFindings
  • C. Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958–992, 2012.
    Google ScholarLocate open access versionFindings
  • A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
    Google ScholarLocate open access versionFindings
  • C. J. Stone. Optimal Global Rates of Convergence for Nonparametric Regression. The Annals of Statistics, 10(4):1040–1053, 1982.
    Google ScholarLocate open access versionFindings
  • X. Tong. A plug-in approach to neyman-pearson classification. The Journal of Machine Learning Research, 14(1):3011–3040, 2013.
    Google ScholarLocate open access versionFindings
  • X. Tong, Y. Feng, and A. Zhao. A survey on neyman-pearson classification and suggestions for future research. Wiley Interdisciplinary Reviews: Computational Statistics, 8(2):64–81, 2016.
    Google ScholarLocate open access versionFindings
  • C. J. Van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365–373, 1974.
    Google ScholarLocate open access versionFindings
  • C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 2nd edition, 1979.
    Google ScholarLocate open access versionFindings
  • J. Wang, X. Shen, and Y. Liu. Probability estimation for large-margin classifiers. Biometrika, 95 (1):149–167, 2008.
    Google ScholarLocate open access versionFindings
  • X. Wang, H. Helen Zhang, and Y. Wu. Multiclass probability estimation with support vector machines. Journal of Computational and Graphical Statistics, pages 1–18, 2019.
    Google ScholarLocate open access versionFindings
  • X. Wang, Y. Tsvetkov, and G. Neubig. Balancing training for multilingual neural machine translation. arXiv preprint arXiv:2004.06748, 2020.
    Findings
  • Y. Wu, H. H. Zhang, and Y. Liu. Robust model-free multiclass probability estimation. Journal of the American Statistical Association, 105(489):424–436, 2010.
    Google ScholarLocate open access versionFindings
  • Y. Yang. Minimax nonparametric classification. i. rates of convergence. IEEE Transactions on Information Theory, 45(7):2271–2284, 1999.
    Google ScholarLocate open access versionFindings
  • Z.-H. Zhou and X.-Y. Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, 2006.
    Google ScholarLocate open access versionFindings
  • X. Zhu, D. Anguelov, and D. Ramanan. Capturing long-tail distributions of object subcategories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 915–922, 2014.
    Google ScholarLocate open access versionFindings
  • G. K. Zipf. The Psycho-Biology of Language: an Introduction to Dynamic Philology. George Routledge & Sons, Ltd., 1936. 16
    Google ScholarFindings
  • For simplicity, we assume that our density estimator ︀ is a local polynomial estimator (Stone, 1982), but the properties that the estimator must have for the following proofs to succeed can also be satisfied by other nonparametric estimators such as kernelized regression (Krzyzak and Pawlak, 1987), and nearest-neighbors regression (Györfi, 1981).
    Google ScholarFindings
  • Now, we turn to Proposition 1, Proposition 2, and Proposition 3. Our proofs rely on the following lemma of Yang (1999). First, we introduce a few additional definitions. Denote the -entropy of Σ with respect to the norm for 1 ≤ ≤ ∞ by H(, Σ, ). We define the norm
    Google ScholarLocate open access versionFindings
  • Lemma 2 (Theorem 1 of Yang 1999). Let be an element of Σ where Σ is a class of functions from R to [0, 1]. Suppose the -entropy satisfies
    Google ScholarLocate open access versionFindings
  • Subsequent works (Audibert and Tsybakov, 2007; Chaudhuri and Dasgupta, 2014) leverage this assumption to provide fast, explicit rates of convergence for expected risk. The margin condition is naturally suited to standard plug-in classification because the decision threshold is 1/2; for weighted plug-in classification, we need a shifted margin condition.
    Google ScholarFindings
  • Before proving this proposition, we prove a helpful lemma that leverages the shifted margin condition, similar to one from Audibert and Tsybakov (2007).
    Google ScholarLocate open access versionFindings
  • Lemma 4 (Theorem 1 of Stone 1982). Let ︀ be a local polynomial regression estimator, and suppose has a density that is lower bounded by some constant min > 0 on its support. Then, we have the following upper bound:
    Google ScholarFindings
  • The above bound is the optimal rate of uniform convergence for nonparametric estimators under the regularity conditions shown here, and local polynomial regression achieves this optimal rate (Stone, 1982).
    Google ScholarFindings
  • Since we may be interested in performance in error metrics other than risk, we discuss other classification metrics here. In particular, we simply show that weighting is “universal” in that it can be used to optimize these other classification metrics. The reason for this is that, in plug-in classification, optimizing many classification metrics is equivalent to altering the threshold for the classification, and this has been observed to lead to the optimal decision rule in many cases (Lewis, 1995; Menon et al., 2013; Narasimhan et al., 2014; Koyejo et al., 2014). We examine the specific case of metrics considered in Koyejo et al. (2014).
    Google ScholarLocate open access versionFindings
  • Koyejo et al. (2014) showed that the optimal classifier for any linear-fractional metric is simply a threshold classifier. Specifically, the following theorem is true.
    Google ScholarFindings
  • Theorem 2 (Koyejo et al. 2014). Let L be a linear-fractional metric, and let be absolutely continuous with respect to the dominating measure on. Define
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments