# Uncertainty-aware Active Learning for Optimal Bayesian Classifier

international conference on learning representations, 2021.

Weibo:

Abstract:

For pool-based active learning, in each iteration a candidate training sample is chosen for labeling by optimizing an acquisition function. In Bayesian classification, expected Loss Reduction~(ELR) methods maximize the expected reduction in the classification error given a new labeled candidate based on a one-step-look-ahead strategy. ELR...More

Code:

Data:

ZH

Introduction

- In supervised learning, labeling data is often expensive and highly time consuming. Active learning is one field of research that aims to address this problem and has been demonstrated for sampleefficient learning with less required labeled data (Gal et al, 2017; Tran et al, 2019; Sinha et al, 2019).
- By optimizing an acquisition function, it chooses the candidate training sample to query for labeling, and based on the acquired data, updates the belief of uncertain models through Bayes’ rule to approach the optimal classifier of the true model, which minimizes the classification error.
- Without identifying whether the uncertainty is related to the classification error or not, these methods can be inefficient in the sense that it may query candidates that do not directly help improve prediction performance.
- When there is model uncertainty with π(θ), an Optimal Bayesian Classifier (OBC) ψπ(θ) is the classifier that has the minimum expected loss over π(θ) (Dalton & Dougherty, 2013): Eπ(θ) [Cθ (ψπ(θ) , x)]

Highlights

- In supervised learning, labeling data is often expensive and highly time consuming
- We propose a novel weighted-mean objective cost of uncertainty (MOCU) active learning method that can focus only on the uncertainty related to the loss for efficient active learning and is guaranteed to converge to the optimal classifier of the true model
- Before presenting the theoretical convergence guarantee of the weighted-MOCU based active learning, we summarize the computation of our weighted-MOCU based acquisition function in Algorithm 1, which can replace expected Loss Reduction (ELR) and MOCU-based acquisition functions in Bayesian active learning algorithms with the pseudo-code given in Appendix B
- We show that if active learning for a binary classification problem is guided by the acquisition function defined by (10) and (11), MOCU will converge to 0 almost surely and the procedure will converge to learning the optimal classifier of the true model
- In addition to the one-dimensional simulated example introduced in Section 1, we test our model on a similar simulation setting as the block in the middle dataset in (Houlsby et al, 2011), where noisy observations with flip error are simulated in a block region on the decision boundary
- We have identified potential convergence problems of existing ELR methods and proposed a novel active learning strategy for classification based on weighted MOCU

Results

- Iteration number

The authors benchmark the weighted-MOCU method with other active learning algorithms, including random sampling, MES (Sebastiani & Wynn, 2000), BALD (Houlsby et al, 2011) and ELR (Roy & McCallum, 2001), on both simulated and real-world classification datasets. - The authors set c = 1 for the weighted MOCU function.
- In addition to the one-dimensional simulated example introduced in Section 1, the authors test the model on a similar simulation setting as the block in the middle dataset in (Houlsby et al, 2011), where noisy observations with flip error are simulated in a block region on the decision boundary.
- For the model parameter prior, w1 ∼ U(0.3, 0.8) is uniformly distributed and w2 ∼ U (−0.25, 0.25) and b ∼ U (−0.25, 0.25); w1, w2 and b are independent

Conclusion

- The authors' weighted MOCU directly targets at decreasing the classification error and ignores uncertainty irrelevant to the classification performance.
- It can capture continuous change in objective-relevant uncertainty.
- Future work includes theoretical analysis of MOCU-guided active learning for multi-class classification, as well as developing optimization methods for active learning in continuous space

Summary

## Introduction:

In supervised learning, labeling data is often expensive and highly time consuming. Active learning is one field of research that aims to address this problem and has been demonstrated for sampleefficient learning with less required labeled data (Gal et al, 2017; Tran et al, 2019; Sinha et al, 2019).- By optimizing an acquisition function, it chooses the candidate training sample to query for labeling, and based on the acquired data, updates the belief of uncertain models through Bayes’ rule to approach the optimal classifier of the true model, which minimizes the classification error.
- Without identifying whether the uncertainty is related to the classification error or not, these methods can be inefficient in the sense that it may query candidates that do not directly help improve prediction performance.
- When there is model uncertainty with π(θ), an Optimal Bayesian Classifier (OBC) ψπ(θ) is the classifier that has the minimum expected loss over π(θ) (Dalton & Dougherty, 2013): Eπ(θ) [Cθ (ψπ(θ) , x)]
## Results:

Iteration number

The authors benchmark the weighted-MOCU method with other active learning algorithms, including random sampling, MES (Sebastiani & Wynn, 2000), BALD (Houlsby et al, 2011) and ELR (Roy & McCallum, 2001), on both simulated and real-world classification datasets.- The authors set c = 1 for the weighted MOCU function.
- In addition to the one-dimensional simulated example introduced in Section 1, the authors test the model on a similar simulation setting as the block in the middle dataset in (Houlsby et al, 2011), where noisy observations with flip error are simulated in a block region on the decision boundary.
- For the model parameter prior, w1 ∼ U(0.3, 0.8) is uniformly distributed and w2 ∼ U (−0.25, 0.25) and b ∼ U (−0.25, 0.25); w1, w2 and b are independent
## Conclusion:

The authors' weighted MOCU directly targets at decreasing the classification error and ignores uncertainty irrelevant to the classification performance.- It can capture continuous change in objective-relevant uncertainty.
- Future work includes theoretical analysis of MOCU-guided active learning for multi-class classification, as well as developing optimization methods for active learning in continuous space

- Table1: The probabilities of p(y|x, θ) and p(y|x, yo)
- Table2: The prior and posterior of π(θ)

Study subjects and analysis

samples: 200

From the figure, MES simply chooses the candidates with the predictive probability closest to 0.5, it can sample many noisy observations from the block region. ELR performs well in the first several iterations but poorly after 200 samples. Our weighted MOCU performs the best

samples: 403

error dataset (Kahraman et al, 2013). The dataset includes 403 samples assigned to 4 classes (High, 0.3. Medium, Low, Very Low) with each sample having five features in [0, 1]5

samples: 224

Medium, Low, Very Low) with each sample having five features in [0, 1]5. We have grouped the samples into two classes with 224 samples in High or. Medium, 179 in Low or Very Low

samples: 150

We present the results with the uncertainty class by setting αi = βi = 10 in eight randomly chosen bins and for the other bins, αi = 5, βi = 2 if the true frequency of High or Medium in the i-th bin is lower than 0.5 and αi = 2, βi = 5 otherwise. We have randomly drawn 150 samples from each class as the candidate pool and perform the five different active learning algorithms. We repeat the whole procedure 150 times and the average error rates are shown in

samples: 150

Classification error rate comparison on UCI User Knowledge dataset random. Classification error rate comparison on UCI Letter Recognition dataset in the i-th bin is higher than 0.5 and αi = 2, βi = 5 if the frequency is lower than 0.5. We also randomly draw 150 samples from each class as the candidate pool and perform the five different active learning algorithms. We repeat the whole procedure 150 times and the average error rates are shown in Fig. S7. In both.

Reference

- Nguyen Viet Cuong, Wee Sun Lee, Nan Ye, Kian Ming A Chai, and Hai Leong Chieu. Active learning for probabilistic hypotheses using the maximum gibbs error criterion. In Advances in Neural Information Processing Systems, pp. 1457–1465, 2013.
- Lori A Dalton and Edward R Dougherty. Optimal classifiers with minimum expected error within a bayesian framework—part i: Discrete and gaussian models. Pattern Recognition, 46(5):1301– 1314, 2013.
- Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183– 1192. JMLR. org, 2017.
- Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. CRC press, 2013.
- Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning with noisy observations. In Advances in Neural Information Processing Systems, pp. 766–774, 2010.
- Trong Nghia Hoang, Bryan Kian Hsiang Low, Patrick Jaillet, and Mohan Kankanhalli. Nonmyopic -bayes-optimal active learning of gaussian processes. 2014.
- Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Mate Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
- H Tolga Kahraman, Seref Sagiroglu, and Ilhami Colak. The development of intuitive knowledge classifier and the modeling of domain dependent data. Knowledge-Based Systems, 37:283–295, 2013.
- Ashish Kapoor, Eric Horvitz, and Sumit Basu. Selective supervision: Guiding supervised learning with decision-theoretic active learning. In IJCAI, volume 7, pp. 877–882, 2007.
- Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In Advances in Neural Information Processing Systems, pp. 7024–7035, 2019.
- Stephen Mussmann and Percy Liang. On the relationship between data efficiency and error for uncertainty sampling. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3674–3682, Stockholmsmassan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
- N Roy and A McCallum. Toward optimal active learning through sampling estimation of error reduction. int. conf. on machine learning, 2001.
- Paola Sebastiani and Henry P Wynn. Maximum entropy sampling and optimal bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1): 145–157, 2000.
- Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
- Toan Tran, Thanh-Toan Do, Ian Reid, and Gustavo Carneiro. Bayesian generative active deep learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 6295–6304, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
- Byung-Jun Yoon, Xiaoning Qian, and Edward R Dougherty. Quantifying the objective cost of uncertainty in complex dynamical systems. IEEE Transactions on Signal Processing, 61(9):2256– 2266, 2013.
- Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semisupervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, volume 3, 2003.
- Denote Nx(n) as the times of the candidate x being queried at the n-th iteration. Based on the posterior consistency theory we have θ∈Θx πn(θ) −a−.s→. 1 as Nx(n) → ∞ (Gelman et al., 2013). Since pn(y|x) = θ∈Θ πn(θ)p(y|x, θ), we have limn→∞ pn(y|x) −a−.s→. p(y|x, θr). Hence limn→∞ πn(θ|x, y) −

Tags

Comments