Advisable Learning for Self-Driving Vehicles by Internalizing Observation-to-Action Rules

CVPR, pp. 9658-9667, 2020.

Cited by: 0|Bibtex|Views115|DOI:https://doi.org/10.1109/CVPR42600.2020.00968
EI
Other Links: dblp.uni-trier.de|academic.microsoft.com
Weibo:
We present a new approach where such advice is used as supervision during training and the controls are predicted based on the textual action commands

Abstract:

Humans learn to drive through both practice and theory, e.g. by studying the rules, while most self-driving systems are limited to the former. Being able to incorporate human knowledge of typical causal driving behaviour should benefit autonomous systems. We propose a new approach that learns vehicle control with the help of human advice....More

Code:

Data:

0
Introduction
  • The proposed vehicle controllers use a variety of approaches; recent efforts [5] suggest that deep neural networks can be effectively applied to the controllers in an end-to-end manner
  • These models, are known to be opaque.
  • The resulting attention maps are not always compelling or human interpretable.
  • Another option is to verbalize the autonomous vehicle’s behaviour with natural language [17], Figure 2 (B).
  • The resulting textual explanations are human understandable, but tend to be rather “shallow”, as they report the more common objects over
Highlights
  • Autonomous driving control has made dramatic progress in the last several years
  • The proposed vehicle controllers use a variety of approaches; recent efforts [5] suggest that deep neural networks can be effectively applied to the controllers in an end-to-end manner
  • We evaluate our approach on the Berkeley DeepDrive-eXplanation dataset [17] and show that our model matches or outperforms prior work in control prediction and textual observation generation
  • We report the vehicle control prediction performance for our model and a number of baselines to evaluate the ability to control a vehicle conditioned on the determined actions
  • Towards learning more human-like driving behavior, we propose to use human advice in the form of observationaction rules
  • We present a new approach where such advice is used as supervision during training and the controls are predicted based on the textual action commands
Methods
  • The authors use the Berkeley DeepDrive-eXplanation (BDD-X) dataset [17] to train and evaluate the proposed slowing moving Input Image.
  • Observation Action command n/a cars front ahead of car is stopped driving car slows to lane stoplight ahead is is is ercmeledpatyr car ccaarr car is iiss is driving gmooinvging not is parked on.
  • Observation: Seq-to-Seq Observation-to-Action “Because the car in front is stopped” “Because traffic is moving at a steady speed”.
  • Attention maps is at moving stopped drsivtoinpgped is accelerating heading is car is carcanrorsmtoaplplyed is stopped is is car car car n/a light the red of a ahead is stopped is car red is light the road is drivingis car clear greecnaatridsritvheinahgcgeaoaredsiwssteraitssitgohpctpcaeatrdrdramrviveaielsisnstqadduiorniicwvskinnmlaygoving is car driving traffic car driving green are front in of stopped is car lights red light in moving is car the down heads stopped is car a for waits stopped is car the it
Conclusion
  • Towards learning more human-like driving behavior, the authors propose to use human advice in the form of observationaction rules.
  • The authors present a new approach where such advice is used as supervision during training and the controls are predicted based on the textual action commands.
  • The authors rely on a semantic visual representation to better ground the textual observations and generate object-centric attention maps.
  • The authors' experiments on the BDD-X dataset show that the model matches or outperforms prior work in control prediction and textual observation generation.
  • The authors' human evaluation on the Carla simulator further shows that the advisable system can increase user trust
Summary
  • Introduction:

    The proposed vehicle controllers use a variety of approaches; recent efforts [5] suggest that deep neural networks can be effectively applied to the controllers in an end-to-end manner
  • These models, are known to be opaque.
  • The resulting attention maps are not always compelling or human interpretable.
  • Another option is to verbalize the autonomous vehicle’s behaviour with natural language [17], Figure 2 (B).
  • The resulting textual explanations are human understandable, but tend to be rather “shallow”, as they report the more common objects over
  • Methods:

    The authors use the Berkeley DeepDrive-eXplanation (BDD-X) dataset [17] to train and evaluate the proposed slowing moving Input Image.
  • Observation Action command n/a cars front ahead of car is stopped driving car slows to lane stoplight ahead is is is ercmeledpatyr car ccaarr car is iiss is driving gmooinvging not is parked on.
  • Observation: Seq-to-Seq Observation-to-Action “Because the car in front is stopped” “Because traffic is moving at a steady speed”.
  • Attention maps is at moving stopped drsivtoinpgped is accelerating heading is car is carcanrorsmtoaplplyed is stopped is is car car car n/a light the red of a ahead is stopped is car red is light the road is drivingis car clear greecnaatridsritvheinahgcgeaoaredsiwssteraitssitgohpctpcaeatrdrdramrviveaielsisnstqadduiorniicwvskinnmlaygoving is car driving traffic car driving green are front in of stopped is car lights red light in moving is car the down heads stopped is car a for waits stopped is car the it
  • Conclusion:

    Towards learning more human-like driving behavior, the authors propose to use human advice in the form of observationaction rules.
  • The authors present a new approach where such advice is used as supervision during training and the controls are predicted based on the textual action commands.
  • The authors rely on a semantic visual representation to better ground the textual observations and generate object-centric attention maps.
  • The authors' experiments on the BDD-X dataset show that the model matches or outperforms prior work in control prediction and textual observation generation.
  • The authors' human evaluation on the Carla simulator further shows that the advisable system can increase user trust
Tables
  • Table1: We report the vehicle control prediction performance for our approach and existing baselines. We compare the performance in terms of the median of average displacement errors (ADEs) as well as the 1st (Q1) and 3rd (Q3) quartiles (lower is better), i.e. Median [Q1, Q3]
  • Table2: We report the quality of the generated textual observations (top) and action commands (bottom). We rely on standard automatic metrics: BLEU-4 [<a class="ref-link" id="c27" href="#r27">27</a>], METEOR [<a class="ref-link" id="c20" href="#r20">20</a>], CIDEr-D [<a class="ref-link" id="c34" href="#r34">34</a>], and SPICE [<a class="ref-link" id="c1" href="#r1">1</a>]. †: reported by [<a class="ref-link" id="c17" href="#r17">17</a>]
Download tables as Excel
Related work
  • End-to-End Learning for Self-driving Vehicles. Recent works [4, 12] suggest that a driving policy can be successfully learned by neural networks through supervised learning over observation (e.g. video) and action (e.g. steering) pairs, that are collected from human demonstration. Bojarski et al [5] trained a 5-layer ConvNet to predict steering controls from a dashcam image, while Xu et al [39] utilized a dilated ConvNet combined with an LSTM so as to predict vehicle’s discretized future motions. Recently, Hecker et al [12] explored the extended model that takes a surroundview multi-camera system, a route planner, and a CAN bus reader. Codevilla et al [7] explored a conditional end-toend driving model that takes high-level command input (i.e.

    left-/right-turn, lane following, and intersection passing) at test time, see Figure 2 (A). To reduce the complexity, there is growing interest in end-to-mid [41] and mid-to-mid [4] driving models that produce a mid-level output representation in the form of a drivable trajectory by consuming either raw sensor or an intermediate scene representation as input. Their behavior, however, is opaque and learning to drive in urban areas remains challenging. These driving models are also known to be “black boxes” and thus lack of transparency may be a major drawback in self-driving applications where a high level of user trust is required to accept such a radical technology.
Funding
  • This work was supported by DARPA XAI program and Berkeley DeepDrive
  • Kim was in part supported by Samsung Scholarship
Reference
  • Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In ECCV, pages 382–398. Springer, 2016. 6, 7
    Google ScholarLocate open access versionFindings
  • Yoav Artzi and Luke Zettlemoyer. Weakly supervised learning of semantic parsers for mapping instructions to actions. TACL, 2013. 3
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2014. 5
    Google ScholarLocate open access versionFindings
  • Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS, 2019. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. CoRR abs/1604.07316, 2016. 1, 2, 6
    Findings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018. 4
    Google ScholarLocate open access versionFindings
  • Felipe Codevilla, Matthias Miiller, Antonio Lopez, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In ICRA, pages 1–9. IEEE, 2018. 2, 6, 7
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. CoRL, 2017. 2, 8
    Google ScholarLocate open access versionFindings
  • Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010. 6
    Google ScholarLocate open access versionFindings
  • David Gunning. Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), 2017. 3
    Google ScholarFindings
  • Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017. 4
    Google ScholarLocate open access versionFindings
  • Simon Hecker, Dengxin Dai, and Luc Van Gool. End-to-end learning of driving models with surround-view cameras and route planners. In ECCV, 2018. 2, 6
    Google ScholarLocate open access versionFindings
  • Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. Generating visual explanations. In ECCV, 2016. 3
    Google ScholarLocate open access versionFindings
  • Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata. Grounding visual explanations. In ECCV, 2018. 3
    Google ScholarLocate open access versionFindings
  • Jinkyu Kim and John Canny. Interpretable learning for selfdriving cars by visualizing causal attention. ICCV, 2017. 1, 2, 3, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Jinkyu Kim, Terihusa Misu, Yi-Ting Chen, Ashish Tawari, and John Canny. Grounding human-to-vehicle advice for self-driving vehicles. CVPR, 2019. 1, 2, 3, 4, 5, 6, 7
    Google ScholarLocate open access versionFindings
  • Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. In ECCV, 2018. 1, 2, 3, 5, 7, 8
    Google ScholarLocate open access versionFindings
  • Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015. 6
    Google ScholarLocate open access versionFindings
  • Gregory Kuhlmann, Peter Stone, Raymond Mooney, and Jude Shavlik. Guiding a reinforcement learner with natural language advice: Initial results in robocup soccer. In AAAI Workshop, 2004. 3
    Google ScholarLocate open access versionFindings
  • Alon Lavie and Abhaya Agarwal. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In EMNLP, 2005. 6, 7
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. Dialogue learning with humanin-the-loop. arXiv preprint arXiv:1611.09823, 2016. 3
    Findings
  • Huan Ling and Sanja Fidler. Teaching machines to describe images via natural language feedback. arXiv preprint arXiv:1706.00130, 2017. 3
    Findings
  • John McCarthy. Programs with common sense. RLE and MIT computation center, 1960. 3
    Google ScholarFindings
  • Dipendra K Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. IJRR, 2016. 3
    Google ScholarLocate open access versionFindings
  • Dipendra Kumar Misra, Kejia Tao, Percy Liang, and Ashutosh Saxena. Environment-driven lexicon induction for high-level instructions. In ACL, 2015. 3
    Google ScholarLocate open access versionFindings
  • Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017. 4
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. 6, 7
    Google ScholarLocate open access versionFindings
  • Junha Roh, Chris Paxton, Andrzej Pronobis, Ali Farhadi, and Dieter Fox. Conditional driving from natural language instructions. CoRL, 2019. 2, 3
    Google ScholarLocate open access versionFindings
  • Samuel Rota Bulo, Lorenzo Porzi, and Peter Kontschieder. In-place activated batchnorm for memory-optimized training of dnns. In CVPR, 2018. 4
    Google ScholarLocate open access versionFindings
  • Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017. 3
    Google ScholarLocate open access versionFindings
  • Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth J Teller, and Nicholas Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, 2011. 3
    Google ScholarLocate open access versionFindings
  • Hsiao-Yu Fish Tung, Adam W Harley, Liang-Kang Huang, and Katerina Fragkiadaki. Reward learning from narrated demonstrations. CVPR, 2018. 3
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 7, 8
    Google ScholarLocate open access versionFindings
  • Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In ICCV, 2015. 6, 7
    Google ScholarFindings
  • Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In ICCV, pages 4534– 4542, 2015. 7, 8
    Google ScholarLocate open access versionFindings
  • Dequan Wang, Coline Devin, Qi-Zhi Cai, Fisher Yu, and Trevor Darrell. Deep object centric policies for autonomous driving. ICRA, 2019. 3
    Google ScholarLocate open access versionFindings
  • Jason E Weston. Dialog-based language learning. In NeurIPS, 2016. 3
    Google ScholarLocate open access versionFindings
  • Jialin Wu and Raymond J Mooney. Faithful multimodal explanation for visual question answering. arXiv preprint arXiv:1809.02805, 2018. 3
    Findings
  • Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. Endto-end learning of driving models from large-scale video datasets. In CVPR, 2017. 2, 6
    Google ScholarLocate open access versionFindings
  • Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014. 3
    Google ScholarLocate open access versionFindings
  • Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, and Raquel Urtasun. End-to-end interpretable neural motion planner. In CVPR, pages 8660–8669, 2019. 3, 4
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, pages 2921–2929, 2016. 3
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments