EGO-TOPO: Environment Affordances from Egocentric Video

CVPR, pp. 160-169, 2020.

被引用2|引用|浏览80|DOI:https://doi.org/10.1109/CVPR42600.2020.00024
EI
其它链接arxiv.org|dblp.uni-trier.de|academic.microsoft.com
微博一下
We proposed a method to produce a topological affordance graph from egocentric video of human activity, highlighting commonly used zones that afford coherent actions across multiple kitchen environments

摘要

First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned...更多

代码

数据

0
简介
  • Scene understanding is largely about answering the who/where/what questions of recognition: what objects are present? is it an indoor/outdoor scene? where is the person and what are they doing? [56, 53, 73, 43, 74, 34, 70, 18]
重点内容
  • “The affordances of the environment are what it offers the animal, what it provides or furnishes
  • On two challenging egocentric datasets, EPIC and EGTEA+, we show the value of modeling the environment explicitly for egocentric video understanding tasks, leading to more robust scene affordance models, and improving over state-of-the-art long range action anticipation models
  • OURS We show the three variants from Sec. 3.2 which use maps built from a single video (OURS-S), multiple videos of the same kitchen (OURS-M), and a functionally linked, consolidated map across kitchens (OURS-C)
  • We proposed a method to produce a topological affordance graph from egocentric video of human activity, highlighting commonly used zones that afford coherent actions across multiple kitchen environments
  • Our experiments on scene affordance learning and long range anticipation demonstrate its viability as an enhanced representation of the environment gained from egocentric video
  • Future work can leverage the environment affordances to guide users in unfamiliar spaces with augmented reality or allow robots to explore a new space through the lens of how it is likely used
方法
  • The authors evaluate the proposed topological graphs for scene affordance learning and action anticipation in long videos.

    Datasets.
  • EGTEA Gaze+ [42] contains videos of 32 subjects following 7 recipes in a single kitchen.
  • EPIC-Kitchens [6] contains videos of daily kitchen activities, and is not limited to a single recipe.
  • It is annotated for interactions spanning 352 objects and 125 actions.
  • Compared to EGTEA+, EPIC is larger, unscripted, and collected across multiple kitchens
结论
  • The authors proposed a method to produce a topological affordance graph from egocentric video of human activity, highlighting commonly used zones that afford coherent actions across multiple kitchen environments.
  • The authors' experiments on scene affordance learning and long range anticipation demonstrate its viability as an enhanced representation of the environment gained from egocentric video.
  • Future work can leverage the environment affordances to guide users in unfamiliar spaces with AR or allow robots to explore a new space through the lens of how it is likely used
总结
  • Introduction:

    Scene understanding is largely about answering the who/where/what questions of recognition: what objects are present? is it an indoor/outdoor scene? where is the person and what are they doing? [56, 53, 73, 43, 74, 34, 70, 18]
  • Methods:

    The authors evaluate the proposed topological graphs for scene affordance learning and action anticipation in long videos.

    Datasets.
  • EGTEA Gaze+ [42] contains videos of 32 subjects following 7 recipes in a single kitchen.
  • EPIC-Kitchens [6] contains videos of daily kitchen activities, and is not limited to a single recipe.
  • It is annotated for interactions spanning 352 objects and 125 actions.
  • Compared to EGTEA+, EPIC is larger, unscripted, and collected across multiple kitchens
  • Conclusion:

    The authors proposed a method to produce a topological affordance graph from egocentric video of human activity, highlighting commonly used zones that afford coherent actions across multiple kitchen environments.
  • The authors' experiments on scene affordance learning and long range anticipation demonstrate its viability as an enhanced representation of the environment gained from egocentric video.
  • Future work can leverage the environment affordances to guide users in unfamiliar spaces with AR or allow robots to explore a new space through the lens of how it is likely used
表格
  • Table1: Environment affordance prediction. Our method outperforms all other methods. Note that videos in EGTEA+ are from the same kitchen, and do not allow cross-kitchen linking. Values are averaged over 5 runs
  • Table2: Long term anticipation results. Our method outperforms all others on EPIC, and is best for many-shot classes on the simpler EGTEA+. Values are averaged over 5 runs
  • Table3: List of afforded interactions annotated for EPIC and EGTEA+
  • Table4: Affordance prediction results with varying grid sizes. SLAMS refers to the SLAM baseline from Sec. 4.1 with an S × S grid
Download tables as Excel
相关工作
  • Egocentric video Whereas the camera is a bystander in traditional third-person vision, in first-person or egocentric vision, the camera is worn by a person interacting with the surroundings firsthand. This special viewpoint offers an array of interesting challenges, such as detecting gaze [41, 29], monitoring human-object interactions [5, 7, 52], creating daily life activity summaries [45, 40, 71, 44], or inferring the camera wearer’s identity or body pose [28, 33]. The field is growing quickly in recent years, thanks in part to new ego-video benchmarks [6, 42, 55, 63].

    Recent work to recognize or anticipate actions in egocentric video adopts state-of-the-art video models from thirdperson video, like two-stream networks [42, 47], 3DConv models [6, 54, 49], or recurrent networks [15, 16, 62, 66]. In contrast, our model grounds first-person activity in a persistent topological encoding of the environment. Methods that leverage SLAM together with egocentric video [20, 58, 64] for activity forecasting also allow spatial grounding, though in a metric manner and with the challenges discussed above.
基金
  • UT Austin is supported in part by ONR PECASE and DARPA L2M
引用论文
  • Y. Abu Farha, A. Richard, and J. Gall. When will you do what?-anticipating temporal occurrences of activities. In CVPR, 2018. 8
    Google ScholarLocate open access versionFindings
  • J.-B. Alayrac, J. Sivic, I. Laptev, and S. Lacoste-Julien. Joint discovery of object states and manipulation actions. ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • D. Ashbrook and T. Starner. Learning significant locations and predicting user movement with gps. In ISWC, 2002. 2
    Google ScholarLocate open access versionFindings
  • F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori. Object level visual reasoning in videos. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • M. Cai, K. M. Kitani, and Y. Sato. Understanding handobject manipulation with grasp types and object attributes. In RSS, 2016. 2
    Google ScholarLocate open access versionFindings
  • D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • D. Damen, T. Leelasawassuk, and W. Mayol-Cuevas. Youdo, i-learn: Egocentric unsupervised discovery of objects and their modes of interaction towards video-based guidance. CVIU, 2016. 2
    Google ScholarLocate open access versionFindings
  • V. Delaitre, D. F. Fouhey, I. Laptev, J. Sivic, A. Gupta, and A. A. Efros. Scene semantics from long-term observation of people. In ECCV, 2012. 2
    Google ScholarLocate open access versionFindings
  • D. DeTone, T. Malisiewicz, and A. Rabinovich. Superpoint: Self-supervised interest point detection and description. In CVPR Workshop, 2018. 3, 11
    Google ScholarLocate open access versionFindings
  • K. Fang, A. Toshev, L. Fei-Fei, and S. Savarese. Scene memory transformer for embodied agents in long-horizon tasks. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • K. Fang, T.-L. Wu, D. Yang, S. Savarese, and J. J. Lim. Demo2vec: Reasoning object affordances from online videos. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • D. F. Fouhey, V. Delaitre, A. Gupta, A. A. Efros, I. Laptev, and J. Sivic. People watching: Human actions as a cue for single view geometry. IJCV, 2014. 3
    Google ScholarFindings
  • A. Furnari, S. Battiato, and G. M. Farinella. Personallocation-based temporal segmentation of egocentric videos for lifelogging applications. JVCIR, 2018. 2
    Google ScholarLocate open access versionFindings
  • A. Furnari, S. Battiato, K. Grauman, and G. M. Farinella. Next-active-object prediction from egocentric videos. JVCI, 2017. 5
    Google ScholarLocate open access versionFindings
  • A. Furnari and G. M. Farinella. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. ICCV, 2019. 1, 2, 5, 8
    Google ScholarLocate open access versionFindings
  • J. Gao, Z. Yang, and R. Nevatia. Red: Reinforced encoderdecoder networks for action anticipation. BMVC, 2017. 2, 5
    Google ScholarLocate open access versionFindings
  • R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, 202, 7, 8
    Google ScholarLocate open access versionFindings
  • G. Gkioxari, R. Girshick, P. Dollar, and K. He. Detecting and recognizing human-object interactions. In CVPR, 201
    Google ScholarLocate open access versionFindings
  • H. Grabner, J. Gall, and L. Van Gool. What makes a chair a chair? In CVPR, 2011. 2
    Google ScholarLocate open access versionFindings
  • J. Guan, Y. Yuan, K. M. Kitani, and N. Rhinehart. Generative hybrid representations for activity forecasting with no-regret learning. arXiv preprint arXiv:1904.06250, 2019. 1, 2, 6, 13
    Findings
  • A. Gupta, S. Satkin, A. A. Efros, and M. Hebert. From 3d scene geometry to human workspace. In CVPR, 2011. 3
    Google ScholarLocate open access versionFindings
  • S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual navigation. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • S. Gupta, D. Fouhey, S. Levine, and J. Malik. Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125, 2017. 2
    Findings
  • R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006. 4
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 3
    Google ScholarLocate open access versionFindings
  • J. F. Henriques and A. Vedaldi. Mapnet: An allocentric spatial memory for mapping environments. In CVPR, 2018. 2
    Google ScholarLocate open access versionFindings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997. 7
    Google ScholarLocate open access versionFindings
  • Y. Hoshen and S. Peleg. An egocentric look at video photographer identity. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • Y. Huang, M. Cai, Z. Li, and Y. Sato. Predicting gaze in egocentric video by learning task-dependent attention transition. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • N. Hussein, E. Gavves, and A. W. Smeulders. Timeception for complex action recognition. In CVPR, 2019. 2, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • N. Hussein, E. Gavves, and A. W. Smeulders. Videograph: Recognizing minutes-long human activities in videos. ICCV Workshop, 2019. 2, 6, 7, 8
    Google ScholarLocate open access versionFindings
  • D. Jayaraman and K. Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In CVPR, 2016. 4
    Google ScholarLocate open access versionFindings
  • H. Jiang and K. Grauman. Seeing invisible poses: Estimating 3d body pose from egocentric video. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015. 1
    Google ScholarLocate open access versionFindings
  • Q. Ke, M. Fritz, and B. Schiele. Time-conditioned action anticipation in one shot. In CVPR, 2019. 8
    Google ScholarLocate open access versionFindings
  • T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017. 6
    Google ScholarLocate open access versionFindings
  • K. Koile, K. Tollmar, D. Demirdjian, H. Shrobe, and T. Darrell. Activity zones for context-aware computing. In UbiComp, 2003. 2
    Google ScholarFindings
  • H. S. Koppula and A. Saxena. Physically grounded spatiotemporal object affordances. In ECCV, 2014. 2
    Google ScholarLocate open access versionFindings
  • H. Kuehne, A. Arslan, and T. Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014. 6
    Google ScholarLocate open access versionFindings
  • Y. J. Lee and K. Grauman. Predicting important objects for egocentric video summarization. IJCV, 2015. 2
    Google ScholarFindings
  • Y. Li, A. Fathi, and J. M. Rehg. Learning to predict gaze in egocentric video. In ICCV, 2013. 2
    Google ScholarFindings
  • Y. Li, M. Liu, and J. M. Rehg. In the eye of beholder: Joint learning of gaze and actions in first person video. In ECCV, 2018. 1, 2, 3, 6
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 1
    Google ScholarLocate open access versionFindings
  • C. Lu, R. Liao, and J. Jia. Personal object discovery in firstperson videos. TIP, 2015. 2
    Google ScholarLocate open access versionFindings
  • Z. Lu and K. Grauman. Story-driven summarization for egocentric video. In CVPR, 2013. 2
    Google ScholarFindings
  • C.-Y. Ma, A. Kadav, I. Melvin, Z. Kira, G. AlRegib, and H. Peter Graf. Attend and interact: Higher-order object interactions for video understanding. In CVPR, 2018. 2
    Google ScholarFindings
  • M. Ma, H. Fan, and K. M. Kitani. Going deeper into firstperson activity recognition. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. JMLR, 2008. 8
    Google ScholarLocate open access versionFindings
  • A. Miech, I. Laptev, J. Sivic, H. Wang, L. Torresani, and D. Tran. Leveraging the present to anticipate the future in videos. In CVPR Workshop, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In ICML, 2009. 4
    Google ScholarLocate open access versionFindings
  • R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam system. Transactions on Robotics, 2015. 3, 6, 13
    Google ScholarLocate open access versionFindings
  • T. Nagarajan, C. Feichtenhofer, and K. Grauman. Grounded human-object interaction hotspots from video. ICCV, 2019. 2
    Google ScholarLocate open access versionFindings
  • M. Pandey and S. Lazebnik. Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV, 2011. 1
    Google ScholarLocate open access versionFindings
  • F. Pirri, L. Mauro, E. Alati, V. Ntouskos, M. Izadpanahkakhk, and E. Omrani. Anticipation and next action forecasting in video: an end-to-end model with memory. arXiv preprint arXiv:1901.03728, 2019. 2, 5
    Findings
  • H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In CVPR, 2012. 2
    Google ScholarLocate open access versionFindings
  • A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR, 2009. 1
    Google ScholarLocate open access versionFindings
  • N. Rhinehart and K. M. Kitani. Learning action maps of large environments via first-person vision. In CVPR, 2016. 3, 6, 7, 12
    Google ScholarLocate open access versionFindings
  • N. Rhinehart and K. M. Kitani. First-person activity forecasting with online inverse reinforcement learning. In ICCV, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In CVPR, 2012. 6
    Google ScholarLocate open access versionFindings
  • N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. ICLR, 2018. 2, 3, 4
    Google ScholarLocate open access versionFindings
  • M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner. Scenegrok: Inferring action maps in 3d environments. TOG, 2014. 3
    Google ScholarLocate open access versionFindings
  • Y. Shi, B. Fernando, and R. Hartley. Action anticipation with rbf kernelized feature mapping rnn. In ECCV, 2018. 2, 5
    Google ScholarLocate open access versionFindings
  • G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626, 2018. 2
    Findings
  • H. Soo Park, J.-J. Hwang, Y. Niu, and J. Shi. Egocentric future localization. In CVPR, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Stein and S. J. McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp, 2013. 6
    Google ScholarFindings
  • S. Sudhakaran, S. Escalera, and O. Lanz. Lsta: Long shortterm attention for egocentric action recognition. In CVPR, 2019. 1, 2
    Google ScholarLocate open access versionFindings
  • X. Wang, R. Girdhar, and A. Gupta. Binge watching: Scaling affordance learning from sitcoms. In CVPR, 2017. 3
    Google ScholarLocate open access versionFindings
  • X. Wang and A. Gupta. Videos as space-time region graphs. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019. 2, 6
    Google ScholarLocate open access versionFindings
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generation by iterative message passing. In CVPR, 2017. 1
    Google ScholarLocate open access versionFindings
  • R. Yonetani, K. M. Kitani, and Y. Sato. Visual motif discovery via first-person vision. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • Y. Zhang, P. Tokmakov, M. Hebert, and C. Schmid. A structured model for action detection. In CVPR, 2019. 2
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017. 1
    Google ScholarLocate open access versionFindings
  • B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019. 1
    Google ScholarLocate open access versionFindings
  • L. Zhou, C. Xu, and J. J. Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018. 6
    Google ScholarLocate open access versionFindings
  • Y. Zhou and T. L. Berg. Temporal perception and prediction in ego-centric video. In ICCV, 2015. 5
    Google ScholarLocate open access versionFindings
下载 PDF 全文
您的评分 :
0

 

标签
评论