VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

CVPR, pp. 11522-11530, 2020.

Cited by: 0|Bibtex|Views135|DOI:https://doi.org/10.1109/CVPR42600.2020.01154
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
We proposed to represent the High Definition map and agent dynamics with a vectorized representation

Abstract:

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e.g. pedestrians and vehicles) and road context information (e.g. lanes, traffic lights). This paper introduces VectorNet, a hie...More

Code:

Data:

0
Introduction
  • This paper focuses on behavior prediction in complex multi-agent systems, such as self-driving vehicles.
  • The core interest is to find a unified representation which integrates the agent dynamics, acquired by perception systems such as.
  • Crosswalk Lane Lane Agent Trajectory.
  • Vectorized Representation object detection and tracking, with the scene context, provided as prior knowledge often in the form of High Definition (HD) maps.
  • The authors' goal is to build a system which learns to predict the intent of vehicles, which are parameterized as trajectories
Highlights
  • This paper focuses on behavior prediction in complex multi-agent systems, such as self-driving vehicles
  • We propose the hierarchical graph network VectorNet and the node completion auxiliary task
  • We evaluate the proposed method on our in-house behavior prediction dataset and the Argoverse dataset, and show that our method achieves on par or better performance over a competitive rendering baseline with 70% model size saving and an order of magnitude reduction in FLOPs
  • We proposed to represent the High Definition map and agent dynamics with a vectorized representation
  • Experiments on the large scale in-house dataset and the public available Argoverse dataset show that the proposed VectorNet outperforms the ConvNet counterpart while at the same time reducing the computational cost by a large margin
Methods
  • The authors first describe the experimental settings, including the datasets, metrics and rasterized + ConvNets baseline.
  • Comprehensive ablation studies are done for both the rasterized baseline and VectorNet. Thirdly, the authors compare and discuss the computation cost, including FLOPs and number of parameters.
  • The authors compare the performance with state-of-the-art methods
Conclusion
  • The authors proposed to represent the HD map and agent dynamics with a vectorized representation.
  • Experiments on the large scale in-house dataset and the public available Argoverse dataset show that the proposed VectorNet outperforms the ConvNet counterpart while at the same time reducing the computational cost by a large margin.
  • A natural step is to incorporate the VectorNet encoder with a multi-modal trajectory decoder (e.g.
  • [6, 29]) to generate diverse future trajectories
  • A natural step is to incorporate the VectorNet encoder with a multi-modal trajectory decoder (e.g. [6, 29]) to generate diverse future trajectories
Summary
  • Introduction:

    This paper focuses on behavior prediction in complex multi-agent systems, such as self-driving vehicles.
  • The core interest is to find a unified representation which integrates the agent dynamics, acquired by perception systems such as.
  • Crosswalk Lane Lane Agent Trajectory.
  • Vectorized Representation object detection and tracking, with the scene context, provided as prior knowledge often in the form of High Definition (HD) maps.
  • The authors' goal is to build a system which learns to predict the intent of vehicles, which are parameterized as trajectories
  • Methods:

    The authors first describe the experimental settings, including the datasets, metrics and rasterized + ConvNets baseline.
  • Comprehensive ablation studies are done for both the rasterized baseline and VectorNet. Thirdly, the authors compare and discuss the computation cost, including FLOPs and number of parameters.
  • The authors compare the performance with state-of-the-art methods
  • Conclusion:

    The authors proposed to represent the HD map and agent dynamics with a vectorized representation.
  • Experiments on the large scale in-house dataset and the public available Argoverse dataset show that the proposed VectorNet outperforms the ConvNet counterpart while at the same time reducing the computational cost by a large margin.
  • A natural step is to incorporate the VectorNet encoder with a multi-modal trajectory decoder (e.g.
  • [6, 29]) to generate diverse future trajectories
  • A natural step is to incorporate the VectorNet encoder with a multi-modal trajectory decoder (e.g. [6, 29]) to generate diverse future trajectories
Tables
  • Table1: Impact of receptive field (as controlled by convolutional kernel size and crop strategy) and rendering resolution for the ConvNet baseline. We report DE and ADE (in meters) on both the in-house dataset and the Argoverse dataset
  • Table2: Ablation studies for VectorNet with different input node types and training objectives. Here “map” refers to the input vectors from the HD maps, and “agents” refers to the input vectors from the trajectories of non-target vehicles. When “Node Compl.” is enabled, the model is trained with the graph completion objective in addition to trajectory prediction. DE and ADE are reported in meters
  • Table3: Ablation on the depth and width of polyline subgraph and global graph. The depth of polyline subgraph has biggest impact on DE@3s
  • Table4: Model FLOPs and number of parameters comparison for
  • Table5: Trajectory prediction performance on the Argoverse Forecasting test set when number of sampled trajectories K=1. Results were retrieved from the Argoverse leaderboard [<a class="ref-link" id="c1" href="#r1">1</a>] on 03/18/2020
Download tables as Excel
Related work
  • Behavior prediction for autonomous driving. Behavior prediction for moving agents has become increasingly important for autonomous driving applications [7, 9, 19], and high-fidelity maps have been widely used to provide context information. For example, IntentNet [5] proposes to jointly detect vehicles and predict their trajectories from LiDAR points and rendered HD maps. Hong et al [15] assumes that vehicle detections are provided and focuses on behavior prediction by encoding entity interactions with ConvNets. Similarly, MultiPath [6] also uses ConvNets as encoder, but adopts pre-defined trajectory anchors to regress multiple possible future trajectories. PRECOG [23] attempts to capture the future stochasiticity by flow-based generative models. Similar to [6, 15, 23], we also assume the agent detections to be provided by an existing perception algorithm. However, unlike these methods which all use ConvNets to encode rendered road maps, we propose to directly encode vectorized scene context and agent dynamics. Forecasting multi-agent interactions. Beyond the autonomous driving domain, there is more general interest to predict the intents of interacting agents, such as for pedestrians [2, 13, 24], human activities [28] or for sports players [12, 26, 32, 33]. In particular, Social LSTM [2] models the trajectories of individual agents as separate LSTM networks, and aggregates the LSTM hidden states based on spatial proximity of the agents to model their interactions. Social GAN [13] simplifies the interaction module and proposes an adversarial discriminator to predict diverse futures. Sun et al [26] combines graph networks [4] with variational RNNs [8] to model diverse interactions. The social interactions can also be inferred from data: Kipf et al [18] treats such interactions as latent variables; and graph attention networks [16, 31] apply self-attention mechanism to weight the edges in a pre-defined graph. Our method goes one step further by proposing a unified hierarchical graph network to jointly model the interactions of multiple agents, and their interactions with the entities from road maps. Representation learning for sets of entities. Traditionally machine perception algorithms have been focusing on highdimensional continuous signals, such as images, videos or audios. One exception is 3D perception, where the inputs are usually in the form of unordered point sets, given by depth sensors. For example, Qi et al propose the PointNet model [20] and PointNet++ [21] to apply permutation invariant operations (e.g. max pooling) on learned point embeddings. Unlike point sets, entities on HD maps and agent trajectories form closed shapes or are directed, and they may also be associated with attribute information. We therefore propose to keep such information by vectorizing the inputs, and encode the attributes as node features in a graph. Self-supervised context modeling. Recently, many works in the NLP domain have proposed modeling language context in a self-supervised fashion [11, 22]. Their learned representations achieve significant performance improvement when transferred to downstream tasks. Inspired by these methods, we propose an auxiliary loss for graph representations, which learns to predict the missing node features from its neighbors. The goal is to incentivize the model to better capture interactions among nodes.
Funding
  • Introduces VectorNet, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and models the high-order interactions among all components
  • Proposes a novel auxiliary task to recover the randomly masked out map entities and agent trajectories based on their context
  • Evaluates VectorNet on our in-house behavior prediction benchmark and the recently released Argoverse forecasting dataset
  • Focuses on behavior prediction in complex multi-agent systems, such as self-driving vehicles
Reference
  • Argoverse Motion Forecasting Competition, 2019. https://evalai.cloudcv.org/web/challenges/challenge-page/454/leaderboard/1279.
    Findings
  • Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Findings
  • Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
    Findings
  • Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning to predict intention from raw sensor data. In CoRL, 2018.
    Google ScholarLocate open access versionFindings
  • Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. In CoRL, 2019.
    Google ScholarLocate open access versionFindings
  • Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3D tracking and forecasting with rich maps. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NeurIPS, 2015.
    Google ScholarLocate open access versionFindings
  • James Colyar and Halkias John. Us highway 101 dataset. FHWAHRT-07-030, 2007.
    Google ScholarFindings
  • Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou, TsungHan Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schneider, and Nemanja Djuric. Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In ICRA, 2019.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Panna Felsen, Pulkit Agrawal, and Jitendra Malik. What will happen next? forecasting player moves in sports videos. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social GAN: Socially acceptable trajectories with generative adversarial networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • Joey Hong, Benjamin Sapp, and James Philbin. Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Yedid Hoshen. VAIN: Attentional multi-agent predictive modeling. arXiv preprint arXiv:1706.06122, 2017.
    Findings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Robert Krajewski, Julian Bock, Laurent Kloeker, and Lutz Eckstein. The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In ITSC, 2018.
    Google ScholarLocate open access versionFindings
  • Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
    Google ScholarFindings
  • Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey Levine. PRECOG: Prediction conditioned on goals in visual multiagent settings. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
    Findings
  • Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and Kevin Murphy. Stochastic prediction of multi-agent interactions from partial observations. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. VideoBERT: A joint model for video and language representation learning. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. Relational action forecasting. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Charlie Tang and Russ R Salakhutdinov. Multiple futures prediction. In NeurIPS. 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Raymond A. Yeh, Alexander G. Schwing, Jonathan Huang, and Kevin Murphy. Diverse generation for multi-agent sports games. In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Eric Zhan, Stephan Zheng, Yisong Yue, Long Sha, and Patrick Lucey. Generative multi-agent behavioral cloning. arXiv:1803.07612, 2018.
    Findings
Full Text
Your rating :
0

 

Tags
Comments