AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We improve how we use auxiliary task data for model pre-training by decomposing gradient updates into components guided by the primary task

AUXILIARY TASK UPDATE DECOMPOSITION: THE GOOD, THE BAD AND THE NEUTRAL

ICLR, (2021)

Cited by: 0|Views28
EI
Full Text
Bibtex
Weibo

Abstract

While deep learning has been very beneficial in data-rich settings, tasks with smaller training set often resort to pre-training or multitask learning to leverage data from other tasks. In this case, careful consideration is needed to select tasks and model parameterizations such that updates from the auxiliary tasks actually help the pri...More

Code:

Data:

0
Introduction
Highlights
  • Multitask learning (Caruana, 1997) and pretraining (Devlin et al, 2018; Caron et al, 2019) have transformed machine learning by allowing downstream tasks with small training sets to benefit from statistical regularities from data-rich related tasks (Collobert & Weston, 2008; Zhang et al, 2014; Liu et al, 2019; Kornblith et al, 2019)
  • We introduce a framework which decomposes the gradient updates from the auxiliary tasks according to their impact on the primary task
  • To achieve a tractable approach, we introduce an efficient, robust algorithm (ATTITTUD, Auxiliary Task Training with Influence from Target Task Update Direction) to estimate the subspace spanned by the primary task gradients in an online manner and decompose the auxiliary updates appropriately
  • When different data is used for the auxiliary task and the primary task (Imdb + Amazon MLM, Amazon + Imdb MLM columns), TaskAdaptive Pre-training (TAPT) does not perform as well as ATTITTUD
  • This highlights the advantage of ATTITTUD when the auxiliary task data distribution differ from the primary task distribution
  • Our method decomposes the gradients of the auxiliary task according to three directions, with positive, negative and neutral impact on the primary task
Methods
  • Method No

    Pretraining Vanilla Pre-training PCGrad Multitask Ours. Method No

    Pretraining Pretrained-ResNet Pretrained-ResNet + Ours

    Medical Imaging Transfer.
  • The authors' method outperforms using a pre-trained Resnet model over Imagenet.
  • The authors apply the endtask-aware ATTITUD over 100k ImageNet images after the initial pretraining and the authors reach 83.3% AUC, an improvement over 81.4%.
  • Text Classification For the NLP experiments, the authors tried limiting the number of layers the authors applied ATTITUD to.
  • For all experiments involving ATTITUD, The authors cross-validate the following choices of the subspace size k ∈ {5, 10, 20} from J ∗ ∈ Rm×D using m ∈ {32, 64}.
  • The authors performed early stopping for all experiments if no improvement after 10 consecutive epochs
Results
  • In the low-resource Cat-vs-Dog, setting ATTITUD produces a bigger boost in performance compared to baselines, with the best performing configuration being ηaux = (1.0, 0.0, 0.0), ηprim = 0.01.
  • The authors posit that this configuration is successful because removal of the in-span components makes overfitting less likely.
  • Note that the best performing configurations are all novel and never an instantiation of PCGrad
Conclusion
  • The authors propose a new approach to training a model with additional help from an auxiliary task.
  • The authors' method decomposes the gradients of the auxiliary task according to three directions, with positive, negative and neutral impact on the primary task.
  • This decomposition allows a flexible re-weighting of the auxiliary task components and give rise to a family of training strategies, which encompasses novel and existing approaches.
  • Experiments in multitasking, pretraining and domain transfer over vision and text classification task demonstrate the empirical benefit of the framework
Tables
  • Table1: Results on Text Classification measured by F1. Experiments are averaged over 5 runs
  • Table2: Average Accuracy on MultiCifar100 and Cat-vs-Dog Cifar10 tasks. Cat-vs-Dog experiments are averaged over 5 runs
  • Table3: Results on ChexPert-5k task measured by average AUC (Area Under Roc-Curve). All experiments are averaged over 5 runs
  • Table4: Experiment conducted on Cat-vr-Dog Cifar10 dataset for different choices of subspace basis. We use k = 5 for Random and Randomized SVD. to random, the basis spanned by k randomly chosen orthogonal vectors in RD, unit avg grad, the basis spanned by the average primary task gradient, and canonical, the per-parameter basis. This ablation was performed under a more limited tuning budget (we cross-validated on configurations (1, 1, 0) and (1, 1, −1) only) than the full Cat-vs-Dog experiments from Table 2
  • Table5: Results on ChexPert-5k tasks measured by average AUC (Area Under Roc-Curve)
Download tables as Excel
Related work
Funding
  • We apply the endtask-aware ATTITUD over 100k ImageNet images after the initial pretraining and we reach 83.3% AUC, an improvement over 81.4%
Reference
  • Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain adaptation via pseudo in-domain data selection. In EMNLP, 2011.
    Google ScholarLocate open access versionFindings
  • Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey, 2015.
    Google ScholarFindings
  • Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149, 2018.
    Google ScholarLocate open access versionFindings
  • Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2959–2968, 2019.
    Google ScholarLocate open access versionFindings
  • Rich Caruana. Multitask learning. Machine Learning, 1997.
    Google ScholarLocate open access versionFindings
  • Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv:1812.00420, 2018.
    Findings
  • Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, 2018.
    Google ScholarLocate open access versionFindings
  • Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, 2008.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. CoRR, abs/1812.02224, 2018.
    Findings
  • Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. arXiv preprint arXiv:1910.07104, 2019.
    Findings
  • Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.740. URL http://dx.doi.org/10.18653/v1/2020.acl-main.740.
    Locate open access versionFindings
  • Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53 (2):217–288, 2011.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019.
    Google ScholarLocate open access versionFindings
  • Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 590–597, 2019.
    Google ScholarLocate open access versionFindings
  • Amit Kumar Jaiswal, Prayag Tiwari, Sachin Kumar, Deepak Gupta, Ashish Khanna, and Joel JPC Rodrigues. Identifying pneumonia in chest x-rays: A deep learning approach. Measurement, 145: 511–518, 2019.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better? 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019. doi: 10.1109/cvpr.2019.00277. URL http://dx.doi.org/10.1109/CVPR.2019.00277.
    Locate open access versionFindings
  • Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
    Google ScholarFindings
  • Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7, 2015.
    Google ScholarLocate open access versionFindings
  • Xingyu Lin, Harjatin Baweja, George Kantor, and David Held. Adaptive auxiliary task weighting for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4772–4783, 2019.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150, 2011.
    Google ScholarLocate open access versionFindings
  • Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 43–52, 2015.
    Google ScholarLocate open access versionFindings
  • Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Conference on Computer Vision and Pattern Recognition, CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Robert C Moore and William Lewis. Intelligent selection of language model training data. In ACL, 2010.
    Google ScholarLocate open access versionFindings
  • Yuji Nakatsukasa. Accuracy of singular vectors obtained by projection-based svd methods. BIT Numerical Mathematics, 57(4):1137–1152, 2017.
    Google ScholarLocate open access versionFindings
  • Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 2971–2980. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6890-variance-based-regularization-with-convex-objectives.pdf.
    Locate open access versionFindings
  • Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Kornblith, Quoc V. Le, and Ruoming Pang. Domain adaptive transfer learning with specialist models. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318, 2013.
    Google ScholarLocate open access versionFindings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
    Google ScholarFindings
  • Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
    Google ScholarLocate open access versionFindings
  • Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging. In Advances in neural information processing systems, pp. 3347–3357, 2019.
    Google ScholarLocate open access versionFindings
  • Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for principal component analysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124, Jan 2010. ISSN 1095-7162. doi: 10.1137/080736417. URL http://dx.doi.org/10.1137/080736417.
    Locate open access versionFindings
  • Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv preprint arXiv:1711.01239, 2017.
    Findings
  • Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv:1706.05098, 2017.
    Findings
  • Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, 2018.
    Google ScholarLocate open access versionFindings
  • Ayan Sinha, Zhao Chen, Vijay Badrinarayanan, and Andrew Rabinovich. Gradient adversarial training of neural networks. arXiv preprint arXiv:1806.08028, 2018.
    Findings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450, 2019.
    Findings
  • Mihai Suteu and Yike Guo. Regularizing deep multi-task networks using orthogonal gradients. arXiv preprint arXiv:1912.06844, 2019.
    Findings
  • Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Revisiting multi-task learning in the deep learning era, 2020.
    Google ScholarFindings
  • Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, and Zarana Parekh. Learning a multi-domain curriculum for neural machine translation. In ACL, 2020a.
    Google ScholarLocate open access versionFindings
  • Xinyi Wang, Hieu Pham, Paul Michel, Antonios Anastasopoulos, Jaime Carbonell, and Graham Neubig. Optimizing data usage via differentiable rewards. In ACL, 2020b.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763, 2019.
    Google ScholarLocate open access versionFindings
  • Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782, 2020.
    Findings
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
    Findings
  • Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In European conference on computer vision. Springer, 2014.
    Google ScholarLocate open access versionFindings
Author
Lucio M. Dery
Lucio M. Dery
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科