The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction

CVPR, pp. 10505-10515, 2019.

Cited by: 2|Bibtex|Views48|DOI:https://doi.org/10.1109/CVPR42600.2020.01052
EI
Other Links: arxiv.org|dblp.uni-trier.de|academic.microsoft.com
Weibo:
Our study is the first to provide a quantitative benchmark and evaluation methodology for multi-future trajectory prediction by using human annotators to create a variety of trajectory continuations under the identical past

Abstract:

This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. We make two main contributions. The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human...More

Code:

Data:

0
Introduction
  • Future path prediction, which aims at forecasting a pedestrian’s future trajectory in the few seconds, has received a lot of attention in the community [20, 1, 15, 26].
  • Since the ground truth data only contains one trajectory, it is difficult to evaluate such probabilistic models
Highlights
  • Forecasting future human behavior is a fundamental problem in video understanding
  • This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes
  • We show that our model achieves the best results on our dataset, as well as on the real-world VIRAT/ActEV dataset. 1
  • We have introduced the Forking Paths dataset, and the Multiverse model for multi-future forecasting
  • Our study is the first to provide a quantitative benchmark and evaluation methodology for multi-future trajectory prediction by using human annotators to create a variety of trajectory continuations under the identical past
  • We have shown that our method achieves state-of-the-art performance on two challenging benchmarks: the large-scale real video dataset and our proposed multi-future trajectory dataset
Methods
  • The authors describe the model for forecasting agent trajectories, which the authors call Multiverse.
  • Since there is inherent uncertainty in this task, the goal is to design a model that can effectively predict multiple plausible future trajectories, by computing the multimodal distribution p(Lh+1:T |L1:h, V1:h).
  • The fine location decoder outputs a vector offset within each grid cell.
  • These are combined to generate a multimodal distribution over R2 for predicted locations
Results
  • The authors evaluate various methods, including the Multiverse model, for multi-future trajectory prediction on the proposed Forking Paths dataset.
  • The authors' model outperforms baselines in all metrics and it performs significantly better on the minADE metric, which suggests better prediction quality over all time instants.
  • The authors measure the standard negative log-likelihood (NLL) metric for the top methods in Table 2.
  • The predicted trajectories are shown in yellow-orange heatmaps for multi-future prediction methods, and in red lines for
Conclusion
  • The authors have introduced the Forking Paths dataset, and the Multiverse model for multi-future forecasting.
  • Together with the models, will facilitate future research and applications on multifuture prediction.
  • The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, NIST, DOI/IBC, the National Science Foundation, Baidu, or the U.S Government
Summary
  • Introduction:

    Future path prediction, which aims at forecasting a pedestrian’s future trajectory in the few seconds, has received a lot of attention in the community [20, 1, 15, 26].
  • Since the ground truth data only contains one trajectory, it is difficult to evaluate such probabilistic models
  • Methods:

    The authors describe the model for forecasting agent trajectories, which the authors call Multiverse.
  • Since there is inherent uncertainty in this task, the goal is to design a model that can effectively predict multiple plausible future trajectories, by computing the multimodal distribution p(Lh+1:T |L1:h, V1:h).
  • The fine location decoder outputs a vector offset within each grid cell.
  • These are combined to generate a multimodal distribution over R2 for predicted locations
  • Results:

    The authors evaluate various methods, including the Multiverse model, for multi-future trajectory prediction on the proposed Forking Paths dataset.
  • The authors' model outperforms baselines in all metrics and it performs significantly better on the minADE metric, which suggests better prediction quality over all time instants.
  • The authors measure the standard negative log-likelihood (NLL) metric for the top methods in Table 2.
  • The predicted trajectories are shown in yellow-orange heatmaps for multi-future prediction methods, and in red lines for
  • Conclusion:

    The authors have introduced the Forking Paths dataset, and the Multiverse model for multi-future forecasting.
  • Together with the models, will facilitate future research and applications on multifuture prediction.
  • The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, NIST, DOI/IBC, the National Science Foundation, Baidu, or the U.S Government
Tables
  • Table1: Comparison of different methods on the Forking Paths dataset. Lower numbers are better. The numbers for the column labeled “45 degrees” are averaged over 3 different 45-degree views. For the input types, “Traj.”, “RGB”, “Seg.” and “Bbox.” mean the inputs are xy coordinates, raw frames, semantic segmentations and bounding boxes of all objects in the scene, respectively. All models are trained on real VIRAT/ActEV videos and tested on synthetic (CARLA-rendered) videos
  • Table2: Negative Log-likelihood comparison of different methods on the Forking Paths dataset. For methods that output multiple trajectories, we quantize the xy-coordinates into the same grid as our method and get a normalized probability distribution prediction
  • Table3: Comparison of different methods on the VIRAT/ActEV dataset. We report ADE/FDE metrics. First column is for models trained on real video training set and second column is for models trained on the simulated version of this dataset
  • Table4: Performance on ablated versions of our model on single and multi-future trajectory prediction. Lower numbers are better
Download tables as Excel
Related work
  • Single-future trajectory prediction. Recent works have tried to predict a single best trajectory for pedestrians or vehicles. Early works [35, 59, 62] focused on modeling person

    2 https://en.wikipedia.org/wiki/The_Garden_of_ Forking_Paths motions by considering them as points in the scene. These research works [21, 60, 33, 30] have attempted to predict person paths by utilizing visual features. Recently Liang et al [30] proposed a joint future activity and trajectory prediction framework that utilized multiple visual features using focal attention [29, 28]. Many works [23, 50, 4, 18, 64] in vehicle trajectory prediction have been proposed. CARNet [50] proposed attention networks on top of scene semantic CNN to predict vehicle trajectories. Chauffeurnet [4] utilized imitation learning for trajectory prediction. Multi-future trajectory prediction. Many works have tried to model the uncertainty of trajectory prediction. Various papers (e.g. [20, 43, 44] use Inverse Reinforcement Learning (IRL) to forecast human trajectories. SocialLSTM [1] is a popular method using social pooling to predict future trajectories. Other works [49, 15, 26, 2] like Social-GAN [15] have utilized generative adversarial networks [14] to generate diverse person trajectories. In vehicle trajectory prediction, DESIRE [23] utilized variational auto-encoders (VAE) to predict future vehicle trajectories. Many recent works [54, 6, 53, 34] also proposed probabilistic frameworks for multi-future vehicle trajectory prediction. Different from these previous works, we present a flexible two-stage framework that combines multi-modal distribution modeling and precise location prediction. Trajectory Datasets. Many vehicle trajectory datasets [5, 7] have been proposed as a result of self-driving’s surging popularity. With the recent advancement in 3D computer vision research [63, 27, 51, 11, 45, 47, 16], many research works [39, 12, 10, 9, 57, 66, 52] have looked into 3D simulated environment for its flexibility and ability to generate enormous amount of data. We are the first to propose a 3D simulation dataset that is reconstructed from real-world scenarios complemented with a variety of human trajectory continuations for multi-future person trajectory prediction.
Funding
  • Acknowledgements This research was supported by NSF grant IIS-1650994, the financial assistance award 60NANB17D156 from NIST and a Baidu Scholarship
  • This work was also supported by IARPA via DOI/IBC contract number D17PC00340
Reference
  • Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In CVPR, 2016. 1, 2, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Javad Amirian, Jean-Bernard Hayet, and Julien Pettre. Social ways: Learning multi-modal distributions of pedestrian trajectories with gans. In CVPRW, 2019. 2
    Google ScholarLocate open access versionFindings
  • George Awad, Asad Butt, Keith Curtis, Jonathan Fiscus, Afzal Godil, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, Georges Qunot, Joao Magalhaes, David Semedo, and Saverio Blasi. Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In TRECVID, 2018. 1, 2, 4, 5, 7
    Google ScholarLocate open access versionFindings
  • Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv preprint arXiv:1812.03079, 2018. 1, 2
    Findings
  • Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019. 2, 4
    Findings
  • Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019. 1, 2, 4, 5, 6
    Findings
  • Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. 3, 6
    Google ScholarLocate open access versionFindings
  • Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In CVPRW, 2018. 2
    Google ScholarLocate open access versionFindings
  • Cesar Roberto de Souza, Adrien Gaidon, Yohann Cabon, and Antonio Manuel Lopez. Procedural generation of videos to train deep action recognition networks. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. arXiv preprint arXiv:1711.03938, 2017. 1, 2, 4
    Findings
  • Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016. 2, 4
    Google ScholarLocate open access versionFindings
  • Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 204
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 202
    Google ScholarLocate open access versionFindings
  • Agrim Gupta, Justin Johnson, Silvio Savarese, Li FeiFei, and Alexandre Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In CVPR, 2018. 1, 2, 4, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, SM Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017. 2
    Findings
  • Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997. 6
    Google ScholarLocate open access versionFindings
  • Joey Hong, Benjamin Sapp, and James Philbin. Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In CVPR, 2019. 2, 5
    Google ScholarLocate open access versionFindings
  • RE Kalman. A new approach to linear filtering and prediction problems. Trans. ASME, D, 82:35–44, 1960. 1
    Google ScholarLocate open access versionFindings
  • Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, and Martial Hebert. Activity forecasting. In ECCV, 2012. 1, 2
    Google ScholarLocate open access versionFindings
  • Julian Francisco Pieter Kooij, Nicolas Schneider, Fabian Flohr, and Dariu M Gavrila. Context-based pedestrian path prediction. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 3
    Google ScholarLocate open access versionFindings
  • Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In CVPR, 2017. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. Crowds by example. In Computer Graphics Forum, pages 655–664. Wiley Online Library, 2007. 2, 4
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Will Monroe, and Dan Jurafsky. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562, 2016. 4, 6
    Findings
  • Yuke Li. Which way are you going? imitative decision learning for path forecasting in dynamic scenes. In CVPR, 2019. 1, 2, 5
    Google ScholarLocate open access versionFindings
  • Junwei Liang, Desai Fan, Han Lu, Poyao Huang, Jia Chen, Lu Jiang, and Alexander Hauptmann. An event reconstruction tool for conflict monitoring using social media. In AAAI, 2017. 2
    Google ScholarLocate open access versionFindings
  • Junwei Liang, Lu Jiang, Liangliang Cao, Yannis Kalantidis, Li-Jia Li, and Alexander G Hauptmann. Focal visual-text attention for memex question answering. IEEE transactions on pattern analysis and machine intelligence, 41(8):1893–1908, 2019. 2
    Google ScholarLocate open access versionFindings
  • Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander G Hauptmann. Focal visual-text attention for visual question answering. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Junwei Liang, Lu Jiang, Juan Carlos Niebles, Alexander G Hauptmann, and Li Fei-Fei. Peeking into the future: Predicting future person activities and locations in videos. In CVPR, 2019. 1, 2, 5, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • Matthias Luber, Johannes A Stork, Gian Diego Tipaldi, and Kai O Arras. People tracking with human motion predictions from social forces. In ICRA, 2010. 2
    Google ScholarLocate open access versionFindings
  • Wei-Chiu Ma, De-An Huang, Namhoon Lee, and Kris M Kitani. Forecasting interactive dynamics of pedestrians with fictitious play. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Osama Makansi, Eddy Ilg, Ozgun Cicek, and Thomas Brox. Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction. In CVPR, 2019. 1, 2, 4, 6
    Google ScholarLocate open access versionFindings
  • Huynh Manh and Gita Alaghband. Scene-lstm: A model for human trajectory prediction. arXiv preprint arXiv:1808.04018, 2018. 2
    Findings
  • Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis, et al. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR, 2011. 1, 2, 4, 5, 7
    Google ScholarLocate open access versionFindings
  • Stefano Pellegrini, Andreas Ess, and Luc Van Gool. Improving data association by joint modeling of pedestrian trajectories and groupings. In ECCV, 2012. 4
    Google ScholarLocate open access versionFindings
  • Tobias Plotz and Stefan Roth. Neural nearest neighbors networks. In NeurIPS, 2018. 6
    Google ScholarLocate open access versionFindings
  • Weichao Qiu, Fangwei Zhong, Yi Zhang, Siyuan Qiao, Zihao Xiao, Tae Soo Kim, and Yizhou Wang. Unrealcv: Virtual worlds for computer vision. In ACM Multimedia, 2017. 2
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015. 4
    Findings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015. 2, 4
    Google ScholarLocate open access versionFindings
  • Nicholas Rhinehart and Kris M Kitani. First-person activity forecasting with online inverse reinforcement learning. In ICCV, 2017. 1
    Google ScholarLocate open access versionFindings
  • Nicholas Rhinehart, Kris M Kitani, and Paul Vernaza. R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting. In ECCV, 2018. 1, 2, 4, 6
    Google ScholarLocate open access versionFindings
  • Nicholas Rhinehart, Rowan McAllister, Kris Kitani, and Sergey Levine. Precog: Prediction conditioned on goals in visual multi-agent settings. arXiv preprint arXiv:1905.01296, 2019. 2, 4, 5, 6
    Findings
  • Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV, 2016. 4
    Google ScholarLocate open access versionFindings
  • German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016. 2, 4
    Google ScholarFindings
  • Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In ICCV, 2017. 1
    Google ScholarLocate open access versionFindings
  • Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Noriaki Hirose, and Silvio Savarese. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. arXiv preprint arXiv:1806.01482, 2018. 2, 5, 6
    Findings
  • Amir Sadeghian, Ferdinand Legros, Maxime Voisin, Ricky Vesel, Alexandre Alahi, and Silvio Savarese. Car-net: Clairvoyant attentive recurrent network. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pages 621–635. Springer, 2018. 2
    Google ScholarLocate open access versionFindings
  • Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and Kevin Murphy. Stochastic prediction of multi-agent interactions from partial observations. arXiv preprint arXiv:1902.09641, 2019. 2
    Findings
  • Yichuan Charlie Tang and Ruslan Salakhutdinov. Multiple futures prediction. arXiv preprint arXiv:1911.00997, 2019. 1, 2
    Findings
  • Luca Anthony Thiede and Pratik Prabhanjan Brahma. Analyzing the variety loss in the context of probabilistic trajectory prediction. arXiv preprint arXiv:1907.10178, 2019. 1, 2, 4
    Findings
  • Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. 3
    Findings
  • Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. Eidetic 3d lstm: A model for video prediction and beyond. In ICLR, 2019. 3
    Google ScholarLocate open access versionFindings
  • Yu Wu, Lu Jiang, and Yi Yang. Revisiting embodiedqa: A simple baseline and beyond. arXiv preprint arXiv:1904.04166, 2019. 2
    Findings
  • SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NeurIPS, 2015. 2, 3, 6
    Google ScholarLocate open access versionFindings
  • Hao Xue, Du Q Huynh, and Mark Reynolds. Sslstm: A hierarchical lstm model for pedestrian trajectory prediction. In WACV, 2018. 2
    Google ScholarLocate open access versionFindings
  • Takuma Yagi, Karttikeya Mangalam, Ryo Yonetani, and Yoichi Sato. Future person localization in firstperson videos. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012. 6
    Findings
  • Pu Zhang, Wanli Ouyang, Pengfei Zhang, Jianru Xue, and Nanning Zheng. Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In CVPR, 2019. 2, 5
    Google ScholarLocate open access versionFindings
  • Yiwei Zhang, Graham M Gibson, Rebecca Hay, Richard W Bowman, Miles J Padgett, and Matthew P Edgar. A fast 3d reconstruction system with a low-cost camera accessory. Scientific reports, 5:10909, 2015. 2
    Google ScholarLocate open access versionFindings
  • Tianyang Zhao, Yifei Xu, Mathew Monfort, Wongun Choi, Chris Baker, Yibiao Zhao, Yizhou Wang, and Ying Nian Wu. Multi-agent tensor fusion for contextual trajectory prediction. In CVPR, 2019. 2, 5
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017. 3, 6
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017. 2
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments