# Class-Weighted Classification: Trade-offs and Robust Approaches

ICML, pp. 10544-10554, 2020.

EI

Weibo:

Abstract:

We address imbalanced classification, the problem in which a label may have low marginal probability relative to other labels, by weighting losses according to the correct class. First, we examine the convergence rates of the expected excess weighted risk of plug-in classifiers where the weighting for the plug-in classifier and the risk...More

Code:

Data:

Introduction

- Classification is a fundamental problem in statistics and machine learning, including scientific problems such as cancer diagnosis and satellite image processing as well as engineering applications such as credit card fraud detection, handwritten digit recognition, and text processing (Khan et al, 2001; Lee et al, 2004), but modern applications have brought new challenges.
- Image data a long tail of many classes with few examples (Salakhutdinov et al, 2011; Zhu et al, 2014).
- In such settings, the classes with smaller probabilities are generally classified incorrectly more often, and this is undesirable when the smaller classes are important, such as rare forms of cancer, fraudulent credit card transactions, and expensive online purchases.
- The authors need modern classification methods that work well when there are a large number of classes and when the class-wise probabilities are imbalanced

Highlights

- Classification is a fundamental problem in statistics and machine learning, including scientific problems such as cancer diagnosis and satellite image processing as well as engineering applications such as credit card fraud detection, handwritten digit recognition, and text processing (Khan et al, 2001; Lee et al, 2004), but modern applications have brought new challenges
- We examine the empirical performance of label conditional value at risk and label heterogeneous conditional value at risk risks, and compare them against the standard risk and a balanced risk as baselines
- Note that the more significant the imbalance, i.e., the smaller the, the better label conditional value at risk and label heterogeneous conditional value at risk perform compared to balanced risk on class 0, while paying a progressively smaller price on the class 1 risk
- We note that while the worst class risk of label conditional value at risk and label heterogeneous conditional value at risk seem to decrease with greater imbalance, this may not be a general property of these methods
- We subsequently show that optimizing with respect to label conditional value at risk and label heterogeneous conditional value at risk empirically improves the worst class risk, at a reasonable cost to accuracy
- If we formalize each prior over the classes as a weighting, optimizing label conditional value at risk or label heterogeneous conditional value at risk may improve performance when the test class priors are different from the training class priors

Methods

- Method Standard Risk Worst Class Risk 0.01 N/A LCVaR 0.05 N/A 0.1 N/A LHCVaR 0.05 1.
- Improving worst class risk comes at a cost to to the standard risk in the case of both LCVaR and LHCVaR.
- This tradeoff is reflected in the histograms of class risk shown in Fig. 4, where the class risks under the standard and balanced classifiers are more spread out and have classes with much lower risks.

Results

- 5.1 Methods

The authors examine the empirical performance of LCVaR and LHCVaR risks, and compare them against the standard risk and a balanced risk as baselines. - The authors note that while the worst class risk of LCVaR and LHCVaR seem to decrease with greater imbalance, this may not be a general property of these methods.
- The main observation is that LCVaR and LHCVaR have lower worst class risk in comparison to the baseline methods
- This empirically demonstrates that both LCVaR and LHCVaR can significantly improve the highest class risks while losing little in performance on classes with lower risks

Conclusion

- The authors have studied the effect of optimizing classifiers with respect to different weightings and developed robust risk measures that minimizes worst case weighted risk across a set of weightings.
- The authors subsequently show that optimizing with respect to LCVaR and LHCVaR empirically improves the worst class risk, at a reasonable cost to accuracy.
- One future direction for research is to understand the Bayes optimal classifier under LCVaR and LHCVaR.
- Another more applied direction could be to consider domain shift.
- If the authors formalize each prior over the classes as a weighting, optimizing LCVaR or LHCVaR may improve performance when the test class priors are different from the training class priors

Summary

## Introduction:

Classification is a fundamental problem in statistics and machine learning, including scientific problems such as cancer diagnosis and satellite image processing as well as engineering applications such as credit card fraud detection, handwritten digit recognition, and text processing (Khan et al, 2001; Lee et al, 2004), but modern applications have brought new challenges.- Image data a long tail of many classes with few examples (Salakhutdinov et al, 2011; Zhu et al, 2014).
- In such settings, the classes with smaller probabilities are generally classified incorrectly more often, and this is undesirable when the smaller classes are important, such as rare forms of cancer, fraudulent credit card transactions, and expensive online purchases.
- The authors need modern classification methods that work well when there are a large number of classes and when the class-wise probabilities are imbalanced
## Methods:

Method Standard Risk Worst Class Risk 0.01 N/A LCVaR 0.05 N/A 0.1 N/A LHCVaR 0.05 1.- Improving worst class risk comes at a cost to to the standard risk in the case of both LCVaR and LHCVaR.
- This tradeoff is reflected in the histograms of class risk shown in Fig. 4, where the class risks under the standard and balanced classifiers are more spread out and have classes with much lower risks.
## Results:

5.1 Methods

The authors examine the empirical performance of LCVaR and LHCVaR risks, and compare them against the standard risk and a balanced risk as baselines.- The authors note that while the worst class risk of LCVaR and LHCVaR seem to decrease with greater imbalance, this may not be a general property of these methods.
- The main observation is that LCVaR and LHCVaR have lower worst class risk in comparison to the baseline methods
- This empirically demonstrates that both LCVaR and LHCVaR can significantly improve the highest class risks while losing little in performance on classes with lower risks
## Conclusion:

The authors have studied the effect of optimizing classifiers with respect to different weightings and developed robust risk measures that minimizes worst case weighted risk across a set of weightings.- The authors subsequently show that optimizing with respect to LCVaR and LHCVaR empirically improves the worst class risk, at a reasonable cost to accuracy.
- One future direction for research is to understand the Bayes optimal classifier under LCVaR and LHCVaR.
- Another more applied direction could be to consider domain shift.
- If the authors formalize each prior over the classes as a weighting, optimizing LCVaR or LHCVaR may improve performance when the test class priors are different from the training class priors

- Table1: Standard risk and risk of the worst class for each method on the Covertype dataset. LCVaR and LHCVaR improve on the worst class risk
- Table2: Performance of LCVaR across different values, and LHCVaR across different values. The performance each method is relatively agnostic to choices of and , although the smallest choices of and for each method have the largest changes in worst class risk, respectively

Related work

- We briefly review other research related to imbalanced classification, but for a far more exhaustive treatment, see a survey of the area (He and Garcia, 2009; Fernández et al, 2018). First, two other methods may be employed to solve imbalanced classification problems. The first is class-based margin adjustment (Lin et al, 2002; Scott, 2012; Cao et al, 2019), in which the margin parameter for the margin loss function may vary by class. Broadly, margin adjustment and weighting may both be considered loss modification procedures. The second method is Neyman-Pearson classification, in which one attempts to minimize the error on one class given a constraint on the worst permissible error on the other class (Rigollet and Tong, 2011; Tong, 2013; Tong et al, 2016).

An important topic related to our paper but that has not been well-connected to imbalanced classification is robust optimization. Robust optimization is a well-studied topic (Ben-Tal and Nemirovski, 1999, 2003; Ben-Tal et al, 2004, 2009). A variant that has gained traction more recently is distributionally robust optimization (Ben-Tal et al, 2013; Bertsimas et al, 2014; Namkoong and Duchi, 2017). Unsurprisingly, CVaR, as a coherent risk measure, has been previously connected to distributionally robust optimization (Goh and Sim, 2010). Distributionally robust optimization generally and CVaR specifically have also previously been used in machine learning to deal with imbalance (Duchi et al, 2018; Duchi and Namkoong, 2018), but in these works, the imbalance was considered to exist in the covariates, whether known to the algorithm or not. These are motivated by the recent push toward fairness in machine learning, in particular so that ethnic minorities do not suffer discrimination in high-stakes situations such as loan applications, medical diagnoses, or parole decisions, due to biases in the data.

Reference

- J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of Statistics, 35(2):608–633, 2007.
- P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
- A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations research letters, 25(1):1–13, 1999.
- A. Ben-Tal and A. Nemirovski. Robust solutions of linear programming problems contaminated with uncertain data. Mathematical programming, 88(3):411–424, 2003.
- A. Ben-Tal, A. Goryashko, E. Guslitzer, and A. Nemirovski. Adjustable robust solutions of uncertain linear programs. Mathematical Programming, 99(2):351–376, 2004.
- A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.
- A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
- D. Bertsimas, V. Gupta, and N. Kallus. Robust sample average approximation. Mathematical Programming, pages 1–66, 2014.
- K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbalanced datasets with labeldistribution-aware margin loss. arXiv preprint arXiv:1906.07413, 2019.
- K. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 3437–3445, 2014.
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002.
- L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition. Springer Science & Business Media, 1996.
- P. Domingos. Metacost: A general method for making classifiers cost-sensitive. In KDD, volume 99, pages 155–164, 1999.
- D. Dua and C. Graff. Uci machine learning repository, 2017.
- J. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
- J. C. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. Arxiv, 2018.
- V. Feldman. Does learning require memorization? a short tale about a long tail. arXiv preprint arXiv:1906.05271, 2019.
- A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera. Learning from imbalanced data sets. Springer, 2018.
- J. Goh and M. Sim. Distributionally robust optimization and its tractable approximations. Operations research, 58(4-part-1):902–917, 2010.
- N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299, 2018.
- L. Györfi. The Rate of Convergence of kn-NN Regression Estimates and Classification Rule. IEEE Transactions on Information Theory, 27(3):357–362, 1981. ISSN 0018-9448.
- E. Hazan. Introduction to online convex optimization. Foundations and Trends○R in Optimization, 2(3-4):157–325, 2016.
- H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
- E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6):673, 2001.
- O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent Binary Classification with Generalized Performance Metrics. In Advances in Neural Information Processing Systems 27, pages 2744–2752. Curran Associates, Inc., 2014.
- A. Krzyzak and M. Pawlak. The pointwise rate of convergence of the kernel regression estimate. Journal of Statistical Planning and Inference, 16:159–166, 1987.
- V. Kuznetsov, M. Mohri, and U. Syed. Rademacher complexity margin bounds for learning with a large number of classes. In ICML Workshop on Extreme Classification: Learning with a Very Large Number of Labels, 2015.
- Y. Lee, G. Wahba, and S. A. Ackerman. Cloud classification of satellite radiance data by multicategory support vector machines. Journal of Atmospheric and Oceanic Technology, 21(2):159–169, 2004.
- Y. Lin, Y. Lee, and G. Wahba. Support vector machines for classification in nonstandard situations. Machine learning, 46(1-3):191–202, 2002.
- Y.-C. Lin, P. Das, and A. Datta. Overview of the SIGIR 2018 eCom Rakuten Data Challenge. In eCOM@ SIGIR, 2018.
- G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi. Bagan: Data augmentation with balancing gan. arXiv preprint arXiv:1803.09655, 2018.
- A. Menon, H. Narasimhan, S. Agarwal, and S. Chawla. On the statistical consistency of algorithms for binary classification under class imbalance. In International Conference on Machine Learning, pages 603–611, 2013.
- M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
- H. Namkoong and J. C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
- H. Narasimhan, R. Vaish, and S. Agarwal. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pages 1493–1501, 2014.
- P. Rigollet and X. Tong. Neyman-pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 12(Oct):2831–2855, 2011.
- R. T. Rockafellar, S. Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2: 21–42, 2000.
- R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual appearance for multiclass object detection. In CVPR 2011, pages 1481–1488. IEEE, 2011.
- C. Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958–992, 2012.
- A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
- C. J. Stone. Optimal Global Rates of Convergence for Nonparametric Regression. The Annals of Statistics, 10(4):1040–1053, 1982.
- X. Tong. A plug-in approach to neyman-pearson classification. The Journal of Machine Learning Research, 14(1):3011–3040, 2013.
- X. Tong, Y. Feng, and A. Zhao. A survey on neyman-pearson classification and suggestions for future research. Wiley Interdisciplinary Reviews: Computational Statistics, 8(2):64–81, 2016.
- C. J. Van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365–373, 1974.
- C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, London, 2nd edition, 1979.
- J. Wang, X. Shen, and Y. Liu. Probability estimation for large-margin classifiers. Biometrika, 95 (1):149–167, 2008.
- X. Wang, H. Helen Zhang, and Y. Wu. Multiclass probability estimation with support vector machines. Journal of Computational and Graphical Statistics, pages 1–18, 2019.
- X. Wang, Y. Tsvetkov, and G. Neubig. Balancing training for multilingual neural machine translation. arXiv preprint arXiv:2004.06748, 2020.
- Y. Wu, H. H. Zhang, and Y. Liu. Robust model-free multiclass probability estimation. Journal of the American Statistical Association, 105(489):424–436, 2010.
- Y. Yang. Minimax nonparametric classification. i. rates of convergence. IEEE Transactions on Information Theory, 45(7):2271–2284, 1999.
- Z.-H. Zhou and X.-Y. Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, 2006.
- X. Zhu, D. Anguelov, and D. Ramanan. Capturing long-tail distributions of object subcategories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 915–922, 2014.
- G. K. Zipf. The Psycho-Biology of Language: an Introduction to Dynamic Philology. George Routledge & Sons, Ltd., 1936. 16
- For simplicity, we assume that our density estimator ︀ is a local polynomial estimator (Stone, 1982), but the properties that the estimator must have for the following proofs to succeed can also be satisfied by other nonparametric estimators such as kernelized regression (Krzyzak and Pawlak, 1987), and nearest-neighbors regression (Györfi, 1981).
- Now, we turn to Proposition 1, Proposition 2, and Proposition 3. Our proofs rely on the following lemma of Yang (1999). First, we introduce a few additional definitions. Denote the -entropy of Σ with respect to the norm for 1 ≤ ≤ ∞ by H(, Σ, ). We define the norm
- Lemma 2 (Theorem 1 of Yang 1999). Let be an element of Σ where Σ is a class of functions from R to [0, 1]. Suppose the -entropy satisfies
- Subsequent works (Audibert and Tsybakov, 2007; Chaudhuri and Dasgupta, 2014) leverage this assumption to provide fast, explicit rates of convergence for expected risk. The margin condition is naturally suited to standard plug-in classification because the decision threshold is 1/2; for weighted plug-in classification, we need a shifted margin condition.
- Before proving this proposition, we prove a helpful lemma that leverages the shifted margin condition, similar to one from Audibert and Tsybakov (2007).
- Lemma 4 (Theorem 1 of Stone 1982). Let ︀ be a local polynomial regression estimator, and suppose has a density that is lower bounded by some constant min > 0 on its support. Then, we have the following upper bound:
- The above bound is the optimal rate of uniform convergence for nonparametric estimators under the regularity conditions shown here, and local polynomial regression achieves this optimal rate (Stone, 1982).
- Since we may be interested in performance in error metrics other than risk, we discuss other classification metrics here. In particular, we simply show that weighting is “universal” in that it can be used to optimize these other classification metrics. The reason for this is that, in plug-in classification, optimizing many classification metrics is equivalent to altering the threshold for the classification, and this has been observed to lead to the optimal decision rule in many cases (Lewis, 1995; Menon et al., 2013; Narasimhan et al., 2014; Koyejo et al., 2014). We examine the specific case of metrics considered in Koyejo et al. (2014).
- Koyejo et al. (2014) showed that the optimal classifier for any linear-fractional metric is simply a threshold classifier. Specifically, the following theorem is true.
- Theorem 2 (Koyejo et al. 2014). Let L be a linear-fractional metric, and let be absolutely continuous with respect to the dominating measure on. Define

Tags

Comments