AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
The detailed implementation consists of de-confounded training and total direct effect inference, which is simple, adaptive, and agnostic to the prior statistics of the class distribution

Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect

NIPS 2020, (2020)

Cited by: 20|Views223
EI
Full Text
PPT
Bibtex
Weibo

Abstract

As the class size grows, maintaining a balanced dataset across many classes is challenging because the data are long-tailed in nature; it is even impossible when the sample-of-interest co-exists with each other in one collectable unit, e.g., multiple visual instances in one image. Therefore, long-tailed classification is the key to deep...More
0
Introduction
  • The authors have witnessed the fast development of computer vision techniques [1,2,3], stemming from large and balanced datasets such as ImageNet [4] and MS-COCO [5].
  • Sume that the majority of the items are irrelevant to a user [53]
  • In such tasks, most of the training samples are background and the background class is a good head class, whose effect should be kept and exempted from the TDE calculation.
  • Most of the training samples are background and the background class is a good head class, whose effect should be kept and exempted from the TDE calculation
  • To this end, the authors propose a background-exempted inference that particular uses the original inference for background class.
Highlights
  • Over the years, we have witnessed the fast development of computer vision techniques [1,2,3], stemming from large and balanced datasets such as ImageNet [4] and MS-COCO [5]
  • We find that the momentum M in any SGD optimizer [18, 19], which is indispensable for stabilizing gradients, is a confounder who is the common cause of the sample feature X and the classification logits Y
  • We first proposed a causal framework to pinpoint the causal effect of momentum in the long-tailed classification, which theoretically explains the previous methods, and provides an elegant one-stage training solution to extract the unbiased direct effect of each instance
  • The detailed implementation consists of de-confounded training and total direct effect inference, which is simple, adaptive, and agnostic to the prior statistics of the class distribution
  • The positive impacts of this work are two-fold: 1) it improves the fairness of the classifier, which prevents the potential discrimination of deep models, e.g., an unfair AI could blindly cater to the majority, causing gender, racial or religious discrimination; 2) it allows the larger vocabulary datasets to be collected without a compulsory class-balancing pre-processing, e.g., to train autonomous vehicles, by using the proposed method, we don’t need collecting as many ambulance images as normal van images do
Methods
  • AP AP50 AP75 APr APc APf classifiers, the authors tested the K = 2 on cosine classifier [50, 51] and capsule classifier [9, 54] in Table 7
  • It proves that the advantage of the proposed de-confounded model doesn’t come from larger K, and the multi-head fine-grained sampling can generally improves the de-confounded training, no matter what kind of normalization function the authors choose.
  • The hyper-parameters for LVIS are still the same as original paper
Results
  • For Long-tailed CIFAR-10/-100 [12, 10], the authors evaluated Top-1 accuracy under three different imbalance ratios: 100/50/10.
  • For LVIS [7], the evaluation metrics are standard segmentation mask AP calculated across IoU threshold 0.5 to 0.95 for all classes.
  • These classes can be categorized by the frequency and independently reported as APr, APc, APf : subscripts r, c, f stand for rare, common, and frequent.
  • The main difference between image classification and object detection/instance
Conclusion
  • The authors first proposed a causal framework to pinpoint the causal effect of momentum in the long-tailed classification, which theoretically explains the previous methods, and provides an elegant one-stage training solution to extract the unbiased direct effect of each instance.
  • The negative impacts could happen when the proposed long-tailed classification technique falls into the wrong hands, e.g., it can be used to identify the minority groups for malicious purposes.
  • It’s the duty to make sure that the long-tailed classification technique is used for the right purpose
Summary
  • Introduction:

    The authors have witnessed the fast development of computer vision techniques [1,2,3], stemming from large and balanced datasets such as ImageNet [4] and MS-COCO [5].
  • Sume that the majority of the items are irrelevant to a user [53]
  • In such tasks, most of the training samples are background and the background class is a good head class, whose effect should be kept and exempted from the TDE calculation.
  • Most of the training samples are background and the background class is a good head class, whose effect should be kept and exempted from the TDE calculation
  • To this end, the authors propose a background-exempted inference that particular uses the original inference for background class.
  • Methods:

    AP AP50 AP75 APr APc APf classifiers, the authors tested the K = 2 on cosine classifier [50, 51] and capsule classifier [9, 54] in Table 7
  • It proves that the advantage of the proposed de-confounded model doesn’t come from larger K, and the multi-head fine-grained sampling can generally improves the de-confounded training, no matter what kind of normalization function the authors choose.
  • The hyper-parameters for LVIS are still the same as original paper
  • Results:

    For Long-tailed CIFAR-10/-100 [12, 10], the authors evaluated Top-1 accuracy under three different imbalance ratios: 100/50/10.
  • For LVIS [7], the evaluation metrics are standard segmentation mask AP calculated across IoU threshold 0.5 to 0.95 for all classes.
  • These classes can be categorized by the frequency and independently reported as APr, APc, APf : subscripts r, c, f stand for rare, common, and frequent.
  • The main difference between image classification and object detection/instance
  • Conclusion:

    The authors first proposed a causal framework to pinpoint the causal effect of momentum in the long-tailed classification, which theoretically explains the previous methods, and provides an elegant one-stage training solution to extract the unbiased direct effect of each instance.
  • The negative impacts could happen when the proposed long-tailed classification technique falls into the wrong hands, e.g., it can be used to identify the minority groups for malicious purposes.
  • It’s the duty to make sure that the long-tailed classification technique is used for the right purpose
Tables
  • Table1: Revisiting the previous state-of-the-arts in our causal graph. CDE: Controlled Direct Effect. NDE: Natural Direct Effect. TDE: Total Direct Effect
  • Table2: The performances on ImageNet-LT test set [<a class="ref-link" id="c9" href="#r9">9</a>]. All models were using the ResNeXt-50 backbone. The superscript † denotes being re-implemented by our framework and hyper-parameters
  • Table3: Top-1 accuracy on Long-tailed CIFAR-10 and CIFAR-100 with different imbalance ratios. All models are using the same ResNet-32 backbone. Note that we report accuracy rather than error rate like BBN [<a class="ref-link" id="c10" href="#r10">10</a>] for consistency
  • Table4: All models are using the same Cascade Mask R-CNN framework [<a class="ref-link" id="c24" href="#r24">24</a>] with R101-FPN backbone [<a class="ref-link" id="c59" href="#r59">59</a>]. The reported results are evaluated on LVIS val set [<a class="ref-link" id="c7" href="#r7">7</a>]
  • Table5: The results of the proposed TDE with/without Background-Exempted Inference on LVIS [<a class="ref-link" id="c7" href="#r7">7</a>] V0.5 val set. The Cascade Mask R-CNN framework [<a class="ref-link" id="c24" href="#r24">24</a>] with R101-FPN backbone [<a class="ref-link" id="c59" href="#r59">59</a>] is used
  • Table6: Hyper-parameters selection based on performances of ImageNet-LT val set, where for α means that TDE inference is not included. The backbone we used here is ResNeXt-50-32x4d
  • Table7: The performances of cosine classifier [<a class="ref-link" id="c50" href="#r50">50</a>, <a class="ref-link" id="c51" href="#r51">51</a>] and capsule classifier [<a class="ref-link" id="c9" href="#r9">9</a>, <a class="ref-link" id="c54" href="#r54">54</a>] under different number of head K on ImageNet-LT test set. Other hyper-parameters are fixed
  • Table8: The performances of the proposed method under different backbones in ImageNet-LT test set
  • Table9: The performances of the proposed method under different backbones in LVIS V0.5 val set
  • Table10: The single model performances of the proposed method on LVIS V0.5 evaluation test server [<a class="ref-link" id="c62" href="#r62">62</a>]
Download tables as Excel
Related work
  • Re-Balanced Training. The most widely-used solution for long-tailed classification is arguably to re-balance the contribution of each class in the training phase. It can be either achieved by resampling [25, 26, 15, 16, 27] or re-weighting [13, 14, 12, 17]. However, they inevitably cause the under-fitting/over-fitting problem to head/tail classes. Besides, relying on the accessibility of data distribution also limits their application scope, e.g., not applicable in online and streaming data.

    Hard Example Mining. The instance-level re-weighting [28,29,30] is also a practical solution. Instead of hacking the prior distribution of classes, focusing on the hard samples also alleviates the long-tailed issue, e.g., using meta-learning to find the conditional weights for each samples [31], enhancing the samples of hard categories by group softmax [32].
Funding
  • We achieve 3.5% and 3.1% absolute improvements on mask AP and box AP using the same Cascade Mask R-CNN with R101-FPN backbone [24]
Study subjects and analysis
samples: 50
Given a learnable parameter θ ∈ R2, and its gradients of instances for class A, B approximate to (1, 1) and (-1, 1) respectively. If each of these two classes has 50 samples, the mean gradient would be (0, 1), which is the optimal gradient direction shared by both A and B. The momentum will thus accelerate on this direction that optimizes the model to fairly discriminate two classes

samples: 99
The momentum will thus accelerate on this direction that optimizes the model to fairly discriminate two classes. However, if there are 99 samples from class A and only 1 sample from class B (long-tailed dataset), the mean gradient would be (0.98, 1). In this case, the momentum direction now approximates to the class A (head) gradients, encouraging the backbone parameters to generate head-like feature vectors, i.e., creating an unfair deviation towards the head

Reference
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755.
    Google ScholarLocate open access versionFindings
  • William J Reed. The pareto, zipf and other power laws. Economics letters, 74(1):15–19, 2001.
    Google ScholarLocate open access versionFindings
  • Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, pages 5356–5364, 2019.
    Google ScholarLocate open access versionFindings
  • Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. Distribution-balanced loss for multi-label classification in long-tailed datasets. In ECCV, 2020.
    Google ScholarLocate open access versionFindings
  • Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. In ICLR, 2020.
    Google ScholarLocate open access versionFindings
  • Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, pages 1567–1578, 2019.
    Google ScholarLocate open access versionFindings
  • Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In CVPR, pages 9268–9277, 2019.
    Google ScholarLocate open access versionFindings
  • Salman H Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A Sohel, and Roberto Togneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE transactions on neural networks and learning systems, 2017.
    Google ScholarFindings
  • Li Shen, Zhouchen Lin, and Qingming Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, pages 467–482.
    Google ScholarLocate open access versionFindings
  • Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, and Junjie Yan. Equalization loss for long-tailed object recognition. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In ICML, pages 1139–1147, 2013.
    Google ScholarLocate open access versionFindings
  • Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145– 151, 1999.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–688, 1995.
    Google ScholarLocate open access versionFindings
  • Judea Pearl. Direct and indirect effects. In Proceedings of the 17th conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2001.
    Google ScholarLocate open access versionFindings
  • Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.
    Google ScholarFindings
  • Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
    Google ScholarLocate open access versionFindings
  • Chris Drummond, Robert C Holte, et al. Class imbalance and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, volume 11, pages 1–8.
    Google ScholarLocate open access versionFindings
  • Xinting Hu, Yi Jiang, Kaihua Tang, Jingyuan Chen, Chunyan Miao, and Hanwang Zhang. Learning to segment the tail. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, 2019.
    Google ScholarLocate open access versionFindings
  • Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050, 2018.
    Findings
  • Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, and Boqing Gong. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Yu Li, Tao Wang, Bingyi Kang, Sheng Tang, Chunfeng Wang, Jintao Li, and Jiashi Feng. Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Yiru Wang, Weihao Gan, Jie Yang, Wei Wu, and Junjie Yan. Dynamic curriculum learning for imbalanced data classification. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Jialun Liu, Yifan Sun, Chuchu Han, Zhaopeng Dou, and Wenhui Li. Deep representation learning on long-tailed data: A learnable embedding augmentation perspective. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Judea Pearl and Dana Mackenzie. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018.
    Google ScholarFindings
  • David P MacKinnon, Amanda J Fairchild, and Matthew S Fritz. Mediation analysis. Annu. Rev. Psychol., 2007.
    Google ScholarLocate open access versionFindings
  • Luke Keele. The statistics of causal inference: A view from political methodology. Political Analysis, 2015.
    Google ScholarLocate open access versionFindings
  • Lorenzo Richiardi, Rino Bellocco, and Daniela Zugna. Mediation analysis in epidemiology: methods, interpretation and bias. International journal of epidemiology, 2013.
    Google ScholarLocate open access versionFindings
  • Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from biased training. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. Two causal principles for improving visual dialog. In CVPR, 2020.
    Google ScholarLocate open access versionFindings
  • Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. arXiv preprint arXiv:2006.04315, 2020.
    Findings
  • Xu Yang, Hanwang Zhang, and Jianfei Cai. Deconfounded image captioning: A causal retrospect. arXiv preprint arXiv:2003.03923, 2020.
    Findings
  • Dong Zhang, Hanwang Zhang, Jinhui Tang, Xiansheng Hua, and Qianru Sun. Causal intervention for weakly-supervised semantic segmentation. In NeurIPS, 2020.
    Google ScholarLocate open access versionFindings
  • Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. Interventional few-shot learning. In NeurIPS, 2020.
    Google ScholarLocate open access versionFindings
  • SGD implementation in PyTorch. https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html.
    Findings
  • Tyler J VanderWeele. A three-way decomposition of a total effect into direct, indirect, and interactive effects. Epidemiology (Cambridge, Mass.), 2013.
    Google ScholarFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 2006.
    Google ScholarFindings
  • Peter C Austin. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 2011.
    Google ScholarLocate open access versionFindings
  • Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Hang Qi, Matthew Brown, and David G Lowe. Low-shot learning with imprinted weights. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Judea Pearl. On the consistency rule in causal inference: axiom, definition, assumption, or theorem? Epidemiology, 21(6):872–875, 2010.
    Google ScholarLocate open access versionFindings
  • Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In UAI, 2009.
    Google ScholarLocate open access versionFindings
  • Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in neural information processing systems, pages 3856–3866, 2017.
    Google ScholarLocate open access versionFindings
  • Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
    Findings
  • Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
    Findings
  • Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
    Google ScholarLocate open access versionFindings
  • Douglas C Montgomery and George C Runger. Applied statistics and probability for engineers. John Wiley & Sons, 2010.
    Google ScholarFindings
  • Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • LVIS v0.5 Evaluation Server. https://evalai.cloudcv.org/web/challenges/challenge-page/473/overview.
    Findings
Author
Kaihua Tang
Kaihua Tang
Jianqiang Huang
Jianqiang Huang
Your rating :
0

 

Tags
Comments
小科