Auxiliary Task Reweighting for Minimum-data Learning

NIPS 2020, 2020.

被引用0|引用|浏览51
EI
其它链接arxiv.org|dblp.uni-trier.de|academic.microsoft.com
微博一下
We develop auxiliary task reweighting for minimum-data learning, an algorithm to automatically reweight auxiliary tasks, so that the data requirement for the main task is minimized

摘要

Supervised learning requires a large amount of training data, limiting its application where labeled data is scarce. To compensate for data scarcity, one possible method is to utilize auxiliary tasks to provide additional supervision for the main task. Assigning and optimizing the importance weights for different auxiliary tasks remains...更多

代码

数据

0
简介
  • Supervised deep learning methods typically require an enormous amount of labeled data, which for many applications, is difficult, time-consuming, expensive, or even impossible to collect.
  • Training with auxiliary tasks has been shown to achieve better generalization [2], and is widely used in many applications, e.g. semi-supervised learning [64], self-supervised learning [49], transfer learning [57], and reinforcement learning [30]
  • Both the main task and auxiliary task are jointly trained, but only the main task’s performance is important for the downstream goals.
  • There are several works along this direction [5, 11, 14, 37], but they either only filter out unrelated tasks without further differentiating among related ones, or have a focused motivation limiting their general use
重点内容
  • Supervised deep learning methods typically require an enormous amount of labeled data, which for many applications, is difficult, time-consuming, expensive, or even impossible to collect
  • We propose a method to adaptively reweight auxiliary tasks on the fly during joint training so that the data requirement on the main task is minimized
  • Our goal is to find the optimal parameter θ∗ for the main task, using data from main task as well as auxiliary tasks
  • We present the final algorithm of auxiliary task reweighting for minimum-data learning (ARML)
  • We show that ARML can minimize data requirement under two realist settings: semi-supervised learning and multi-label classification. we consider the following task reweighting methods for comparison: (i) Uniform: all weights are set to 1, (ii) AdaLoss [25]: tasks are reweighted based on uncertainty, (iii) GradNorm [11]: balance each task’s gradient norm, (iv) CosineSim [14]: tasks are filtered out when having negative cosine similarity cos(∇ log p(Tak |θ), ∇ log p(Tm|θ)), (v) OL_AUX [37]: tasks have higher weights when the gradient inner product ∇ log p(Tak |θ)T ∇ log p(Tm|θ) is large
  • We develop ARML, an algorithm to automatically reweight auxiliary tasks, so that the data requirement for the main task is minimized
方法
  • As shown in Fig. 3, ARML is able to improve the accuracy over different domain generalization or FSDA methods.
  • When Nm = 1, FSDA methods are underperformed, ARML can still bring an improvement of ∼ 4% accuracy.
  • This means ARML can benefit unsupervised domain generalization with as few as 1 labeled image per class
结论
  • The authors develop ARML, an algorithm to automatically reweight auxiliary tasks, so that the data requirement for the main task is minimized.
  • The authors first formulate the weighted likelihood function of auxiliary tasks as a surrogate prior for the main task.
  • The optimal weights are obtained by minimizing the divergence between the surrogate prior and the true prior.
  • The authors design a practical algorithm by turning the optimization problem into minimizing the distance between main task gradient and auxiliary task gradients.
  • The authors demonstrate its effectiveness and robustness in reducing the data requirement under various settings including the extreme case of only a few examples
总结
  • Introduction:

    Supervised deep learning methods typically require an enormous amount of labeled data, which for many applications, is difficult, time-consuming, expensive, or even impossible to collect.
  • Training with auxiliary tasks has been shown to achieve better generalization [2], and is widely used in many applications, e.g. semi-supervised learning [64], self-supervised learning [49], transfer learning [57], and reinforcement learning [30]
  • Both the main task and auxiliary task are jointly trained, but only the main task’s performance is important for the downstream goals.
  • There are several works along this direction [5, 11, 14, 37], but they either only filter out unrelated tasks without further differentiating among related ones, or have a focused motivation limiting their general use
  • Objectives:

    The authors' goal is to find the optimal parameter θ∗ for the main task, using data from main task as well as auxiliary tasks.
  • The full algorithm is shown in Alg. 1.
  • The authors' objective is (A11).
  • The authors' goal is to find the optimal α∗ that minimizes KLα, i.e., α∗ = arg minα KLα = arg maxα e−KLα
  • Methods:

    As shown in Fig. 3, ARML is able to improve the accuracy over different domain generalization or FSDA methods.
  • When Nm = 1, FSDA methods are underperformed, ARML can still bring an improvement of ∼ 4% accuracy.
  • This means ARML can benefit unsupervised domain generalization with as few as 1 labeled image per class
  • Conclusion:

    The authors develop ARML, an algorithm to automatically reweight auxiliary tasks, so that the data requirement for the main task is minimized.
  • The authors first formulate the weighted likelihood function of auxiliary tasks as a surrogate prior for the main task.
  • The optimal weights are obtained by minimizing the divergence between the surrogate prior and the true prior.
  • The authors design a practical algorithm by turning the optimization problem into minimizing the distance between main task gradient and auxiliary task gradients.
  • The authors demonstrate its effectiveness and robustness in reducing the data requirement under various settings including the extreme case of only a few examples
表格
  • Table1: Test error of semi-supervised learning on CIFAR-10 and SVHN. From top to bottom
  • Table2: Test error of main task on CelebA
  • Table3: Top 5 relative / irrelative attributes (auxiliary tasks) to the target attribute (main task) on CelebA
  • Table4: Results of multi-source domain generalization (w/ extra 5 labeled images per class in target domain). We list results with each of four domains as target domain. From top to down: domain generalization methods, FSDA methods and different methods equipped with ARML. JT is short for joint-training. † means the results we reproduced are higher than originally reported
  • Table5: Standard deviation of different types of noise. We find that the gradient noise is negligible compared to the injected noise
Download tables as Excel
相关工作
  • Additional Supervision from Auxiliary Tasks When there is not enough data to learn a task, it is common to introduce additional supervision from some related auxiliary tasks. For example, in semisupervised learning, previous work has employed various kinds of manually-designed supervision on unlabeled data [48, 56, 64]. In reinforcement leaning, due to sample inefficiency, auxiliary tasks (e.g. vision prediction [43], reward prediction [55]) are jointly trained to speed up convergence. In transfer learning or domain adaptation, models are trained on related domains/tasks and generalize to unseen domains [4, 8, 57]. Learning using privileged information (LUPI) also employs additional knowledge (e.g. meta data, additional modality) during training time [24, 54, 59]. However, LUPI does not emphasize the scarcity of training data as in our problem setting.
基金
  • Darrell’s group was supported in part by DoD, BAIR and BDD
  • Saenko was supported by DARPA and NSF
  • Hoffman was supported by DARPA
引用论文
  • Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for metalearning. In Proceedings of the IEEE International Conference on Computer Vision, pages 6430–6439, 2019.
    Google ScholarLocate open access versionFindings
  • Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
    Google ScholarLocate open access versionFindings
  • Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
    Findings
  • Nader Asadi, Mehrdad Hosseinzadeh, and Mahdi Eftekhari. Towards shape biased unsupervised representation learning for domain generalization. arXiv preprint arXiv:1909.08245, 2019.
    Findings
  • Bart Bakker and Tom Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research, 4(May):83–99, 2003.
    Google ScholarLocate open access versionFindings
  • Jonathan Baxter. A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning, 28(1):7–39, 1997.
    Google ScholarLocate open access versionFindings
  • Hakan Bilen and Andrea Vedaldi. Integrated perception with recurrent multi-task neural networks. In Advances in neural information processing systems, pages 235–243, 2016.
    Google ScholarLocate open access versionFindings
  • Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tommasi. Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2229–2238, 2019.
    Google ScholarLocate open access versionFindings
  • Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
    Findings
  • Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232, 2019.
    Findings
  • Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257, 2017.
    Findings
  • Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167, 2008.
    Google ScholarLocate open access versionFindings
  • Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
    Findings
  • Yunshu Du, Wojciech M Czarnecki, Siddhant M Jayakumar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224, 2018.
    Findings
  • Antonio D’Innocente and Barbara Caputo. Domain generalization with domain-specific aggregation modules. In German Conference on Pattern Recognition, pages 187–198.
    Google ScholarLocate open access versionFindings
  • Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? ACM Transactions on graphics (TOG), 31(4):1–10, 2012.
    Google ScholarLocate open access versionFindings
  • Theodoros Evgeniou and Massimiliano Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117, 2004.
    Google ScholarLocate open access versionFindings
  • Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Transductive multiview zero-shot learning. IEEE transactions on pattern analysis and machine intelligence, 37(11):2332–2345, 2015.
    Google ScholarLocate open access versionFindings
  • Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
    Google ScholarLocate open access versionFindings
  • Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
    Google ScholarFindings
  • Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • TM Heskes. Empirical bayes for learning to learn. In Proceedings of the 17th international conference on Machine learning, pages 364–367, 2000.
    Google ScholarLocate open access versionFindings
  • Judy Hoffman, Saurabh Gupta, and Trevor Darrell. Learning with side information through modality hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 826–834, 2016.
    Google ScholarLocate open access versionFindings
  • Hanzhang Hu, Debadeepta Dey, Martial Hebert, and J Andrew Bagnell. Learning anytime predictions in neural networks via adaptive loss balancing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3812–3821, 2019.
    Google ScholarLocate open access versionFindings
  • Tianyang Hu, Zixiang Chen, Hanxi Sun, Jincheng Bai, Mao Ye, and Guang Cheng. Stein neural sampler. arXiv preprint arXiv:1810.03545, 2018.
    Findings
  • Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7304–7308. IEEE, 2013.
    Google ScholarLocate open access versionFindings
  • Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
    Findings
  • Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
    Findings
  • Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
    Findings
  • Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 2, 2013.
    Google ScholarLocate open access versionFindings
  • Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017.
    Google ScholarLocate open access versionFindings
  • Xingyu Lin, Harjatin Baweja, George Kantor, and David Held. Adaptive auxiliary task weighting for reinforcement learning. In Advances in Neural Information Processing Systems, pages 4773–4784, 2019.
    Google ScholarLocate open access versionFindings
  • Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284, 2016.
    Google ScholarLocate open access versionFindings
  • Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378–2386, 2016.
    Google ScholarLocate open access versionFindings
  • Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
    Google ScholarLocate open access versionFindings
  • Mingsheng Long, Zhangjie Cao, Jianmin Wang, and S Yu Philip. Learning multiple tasks with multilinear relationship networks. In Advances in neural information processing systems, pages 1594–1603, 2017.
    Google ScholarLocate open access versionFindings
  • Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3, 2013.
    Google ScholarLocate open access versionFindings
  • Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
    Findings
  • Saeid Motiian, Quinn Jones, Seyed Iranmanesh, and Gianfranco Doretto. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 6670–6680, 2017.
    Google ScholarLocate open access versionFindings
  • Iain Murray and Zoubin Ghahramani. Bayesian learning in undirected graphical models: approximate mcmc algorithms. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 392–399. AUAI Press, 2004.
    Google ScholarLocate open access versionFindings
  • Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
    Google ScholarLocate open access versionFindings
  • Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
    Google ScholarFindings
  • Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pages 3235–3246, 2018.
    Google ScholarLocate open access versionFindings
  • Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    Findings
  • Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019.
    Google ScholarLocate open access versionFindings
  • Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
    Findings
  • Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016.
    Google ScholarLocate open access versionFindings
  • Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pages 527–538, 2018.
    Google ScholarLocate open access versionFindings
  • Viktoriia Sharmanska, Novi Quadrianto, and Christoph H Lampert. Learning to rank using privileged information. In Proceedings of the IEEE International Conference on Computer Vision, pages 825–832, 2013.
    Google ScholarLocate open access versionFindings
  • Evan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
    Findings
  • Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
    Google ScholarLocate open access versionFindings
  • Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4068–4076, 2015.
    Google ScholarLocate open access versionFindings
  • Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
    Google ScholarLocate open access versionFindings
  • Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural networks, 22(5-6):544–557, 2009.
    Google ScholarLocate open access versionFindings
  • Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
    Google ScholarLocate open access versionFindings
  • Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8(Jan):35– 63, 2007.
    Google ScholarLocate open access versionFindings
  • Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings of the 22nd international conference on Machine learning, pages 1012–1019, 2005.
    Google ScholarLocate open access versionFindings
  • Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
    Google ScholarLocate open access versionFindings
  • Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE international conference on computer vision, pages 1476–1485, 2019.
    Google ScholarLocate open access versionFindings
  • Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationships in multi-task learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 733–742, 2010.
    Google ScholarLocate open access versionFindings
下载 PDF 全文
您的评分 :
0

 

标签
评论