Indoor segmentation and support inference from RGBD images

Nathan Silberman
Nathan Silberman

ECCV (5), pp. 746-760, 2012.

Cited by: 2477|Bibtex|Views191|DOI:https://doi.org/10.1007/978-3-642-33715-4_54
EI
Other Links: dblp.uni-trier.de|dl.acm.org|academic.microsoft.com
Weibo:
Our dataset is unique in the diversity and complexity of depicted indoor scenes, and we provide an approach to parse such complex environments through appearance cues, room-aligned 3D cues, surface fitting, and scene priors

Abstract:

We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, a...More

Code:

Data:

0
Introduction
  • Traditional approaches to scene understanding aim to provide labels for each object in the image.
  • A person walking into a room, for example, might want to find his coffee cup and favorite book, grab them, find a place to sit down, walk over, and sit down
  • These tasks require parsing the scene into different objects and surfaces – the coffee cup must be distinguished from surrounding objects and the supporting surface for example.
  • Some tasks require understanding the interactions of scene elements: if the coffee cup is supported by the book, the cup must be lifted first
Highlights
  • Traditional approaches to scene understanding aim to provide labels for each object in the image
  • Many robotics and scene understanding applications require a physical parse of the scene into objects, surfaces, and their relations
  • These tasks require parsing the scene into different objects and surfaces – the coffee cup must be distinguished from surrounding objects and the supporting surface for example
  • We have introduced a new dataset useful for various tasks including recognition, segmentation and inference of physical support relationships
  • Our dataset is unique in the diversity and complexity of depicted indoor scenes, and we provide an approach to parse such complex environments through appearance cues, room-aligned 3D cues, surface fitting, and scene priors
  • We show that initial estimates of support and major surfaces lead to better segmentation
Methods
  • 6.1 Evaluating Segmentation

    To evaluate the segmentation algorithm, the authors use the overlap criteria from [8].
  • 54.1 in which at each stage of the segmentation, the authors extracted and classified support and structure class features from the intermediate segmentations and used the support and structure classifier output as features for boundary classification.
  • The addition of these features both improve segmentation performance with Support providing a slightly larger gain.
  • To avoid penalizing the support inference for errors in the bottom up segmentation, the mapping is performed as follows: each support label from the ground truth region [RiGT , RjGT , T ] is replaced with a set of labels [RaS1 , RbS1 , T ]...[RaSw , RbSw , T ] where the overlap between supported regions (RiGT ,RaSw ) and supporting regions, (RjGT ,RbSw ) exceeds a threshold (.25)
Conclusion
  • The authors have introduced a new dataset useful for various tasks including recognition, segmentation and inference of physical support relationships.
  • The authors' experiments show that the authors can reliably infer the supporting region and the type of support, especially when segmentations are accurate.
  • The authors show that initial estimates of support and major surfaces lead to better segmentation.
  • Future work could include inferring the full extent of objects and surfaces and categorizing objects
Summary
  • Introduction:

    Traditional approaches to scene understanding aim to provide labels for each object in the image.
  • A person walking into a room, for example, might want to find his coffee cup and favorite book, grab them, find a place to sit down, walk over, and sit down
  • These tasks require parsing the scene into different objects and surfaces – the coffee cup must be distinguished from surrounding objects and the supporting surface for example.
  • Some tasks require understanding the interactions of scene elements: if the coffee cup is supported by the book, the cup must be lifted first
  • Methods:

    6.1 Evaluating Segmentation

    To evaluate the segmentation algorithm, the authors use the overlap criteria from [8].
  • 54.1 in which at each stage of the segmentation, the authors extracted and classified support and structure class features from the intermediate segmentations and used the support and structure classifier output as features for boundary classification.
  • The addition of these features both improve segmentation performance with Support providing a slightly larger gain.
  • To avoid penalizing the support inference for errors in the bottom up segmentation, the mapping is performed as follows: each support label from the ground truth region [RiGT , RjGT , T ] is replaced with a set of labels [RaS1 , RbS1 , T ]...[RaSw , RbSw , T ] where the overlap between supported regions (RiGT ,RaSw ) and supporting regions, (RjGT ,RbSw ) exceeds a threshold (.25)
  • Conclusion:

    The authors have introduced a new dataset useful for various tasks including recognition, segmentation and inference of physical support relationships.
  • The authors' experiments show that the authors can reliably infer the supporting region and the type of support, especially when segmentations are accurate.
  • The authors show that initial estimates of support and major surfaces lead to better segmentation.
  • Future work could include inferring the full extent of objects and surfaces and categorizing objects
Tables
  • Table1: Accuracy of hierarchical segmentation, measured as average overlap over ground truth regions for best-matching segmented region, either weighted by pixel area or unweighted
  • Table2: Results of the various approaches to support inference. Accuracy is measured by total regions whose support is correctly inferred divided by the number of labeled regions. Type Aware accuracy penalized incorrect support type and Type Agnostic does not
Download tables as Excel
Related work
  • Our overall approach of incorporating geometric priors to improve scene interpretation is most related to a set of image-based single-view methods (e.g. [1,2,3,4,5,6,7]). Our use of “structural classes”, such as “furniture” and “prop”, to improve segmentation and support inference relates to the use of “geometric classes” [1] to segment objects [8] or volumetric scene parses [3, 5,6,7]. Our goal of inferring support relations is most closely related to Gupta et al [6], who apply heuristics inspired by physical reasoning to infer volumetric shapes, occlusion, and support in outdoor scenes. Our 3D cues provide a much stronger basis for inference of support, and our dataset enables us to train and evaluate support predictors that can cope with scene clutter and invisible supporting regions. Russell and Torralba [9] show how a dataset of user-annotated scenes can be used to infer 3D structure and support; our approach, in contrast, is fully automatic. Input RGB
Funding
  • This work was supported in part by NSF Awards 0904209, 09-16014 and IIS-1116923
Reference
  • Hoiem, D., Efros, A.A., Hebert, M.: Geometric context from a single image. In: ICCV (2005)
    Google ScholarFindings
  • Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: ICCV (2009)
    Google ScholarFindings
  • Hedau, V., Hoiem, D., Forsyth, D.: Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 224–237. Springer, Heidelberg (2010)
    Google ScholarLocate open access versionFindings
  • Lee, D.C., Hebert, M., Kanade, T.: Geometric reasoning for single image structure recovery. In: CVPR (2009)
    Google ScholarFindings
  • Lee, D.C., Gupta, A., Hebert, M., Kanade, T.: Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In: NIPS (2010)
    Google ScholarFindings
  • Gupta, A., Efros, A.A., Hebert, M.: Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 482–496.
    Google ScholarLocate open access versionFindings
  • Gupta, A., Satkin, S., Efros, A.A., Hebert, M.: From 3d scene geometry to human workspace. In: CVPR (2011)
    Google ScholarFindings
  • Hoiem, D., Efros, A.A., Hebert, M.: Recovering occlusion boundaries from an image. Int. J. Comput. Vision 91, 328–346 (2011)
    Google ScholarLocate open access versionFindings
  • Russell, B.C., Torralba, A.: Building a database of 3d scenes from user annotations. In: CVPR (2009)
    Google ScholarFindings
  • Zhang, C., Wang, L., Yang, R.: Semantic Segmentation of Urban Scenes Using Dense Depth Maps. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 708–721. Springer, Heidelberg (2010)
    Google ScholarLocate open access versionFindings
  • Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: ICCV Workshop on 3D Representation and Recognition (2011)
    Google ScholarLocate open access versionFindings
  • Karayev, S., Janoch, A., Jia, Y., Barron, J., Fritz, M., Saenko, K., Darrell, T.: A category-level 3-d database: Putting the kinect to work. In: ICCV Workshop on Consumer Depth Cameras for Computer Vision (2011)
    Google ScholarLocate open access versionFindings
  • Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view rgb-d object dataset. In: ICRA (2011)
    Google ScholarFindings
  • Koppula, H., Anand, A., Joachims, T., Saxena, A.: Semantic labeling of 3d point clouds for indoor scenes. In: NIPS (2011)
    Google ScholarFindings
  • Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. In: SIGGRAPH (2004)
    Google ScholarFindings
  • Coughlan, J., Yuille, A.: Manhattan world: orientation and outlier detection by Bayesian inference. Neural Computation 15 (2003)
    Google ScholarLocate open access versionFindings
  • Kosecka, J., Zhang, W.: Video Compass. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 476–490. Springer, Heidelberg (2002)
    Google ScholarLocate open access versionFindings
  • Arbelaez, P.: Boundary extraction in natural images using ultrametric contour maps. In: Proc. POCV (2006)
    Google ScholarFindings
  • Tighe, J., Lazebnik, S.: SuperParsing: Scalable Nonparametric Image Parsing with Superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 352–365.
    Google ScholarLocate open access versionFindings
Full Text
Your rating :
0

 

Tags
Comments