AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We propose a novel architecture, called the TL-embedding network, to learn an embedding space with these properties

Learning a Predictable and Generative Vector Representation for Objects.

ECCV, (2016): 484-499

Cited by: 506|Views172
EI

Abstract

What is a good vector representation of an object? We believe that it should be generative in 3D, in the sense that it can produce new 3D objects; as well as be predictable from 2D, in the sense that it can be perceived from 2D images. We propose a novel architecture, called the TL-embedding network, to learn an embedding space with these...More

Code:

Data:

0
Introduction
  • What is a good vector representation for objects? On the one hand, there has been a great deal of work on discriminative models such as ConvNets [18,32] mapping 2D pixels to semantic labels.
  • There is an alternate line of work focusing on learning to generate objects using 3D CAD models and deconvolutional networks [5,19].
  • These criteria are often at odds with each other: modeling occluded voxels in 3D is useful for generating objects but very difficult to predict from an image.
Highlights
  • What is a good vector representation for objects? On the one hand, there has been a great deal of work on discriminative models such as ConvNets [18,32] mapping 2D pixels to semantic labels
  • In contrast to the purely discriminative paradigm, these approaches explicitly address the 3D nature of objects and have shown success in generative tasks; they offer no guarantees that their representations can be inferred from images and have not been shown to be useful for natural image tasks
  • We propose to unify these two threads of research together and propose a new vector representation of objects
  • Our experiments demonstrate that: (1) our representation is generative in 3D, permitting reconstruction of novel CAD models; (2) our representation is predictable from 2D, allowing us to predict the full 3D voxels of an object from an image, as well as do fast CAD model retrieval from a natural image; and (3) that the learned space has a number of good properties, such as being smooth, carrying class-discriminative information, and allowing vector arithmetic
  • Among a large body of works trying to infer 3D representations from images, our approach is most related to a group of works using renderings of 3D CAD models to predict properties such as object viewpoint [35] or class [34], among others [33,10,27]
  • We report a comparison with [2] on their 315 image, 105 model labeled evaluation set. [2]’s method is an approach that is specific to nearest-neighbor model retrieval and has a number of advantages over our approach
Results
  • The authors take the 3D voxel map of a CAD model as well as its 2D rendered image and jointly optimize the components.
  • The authors can use the autoencoder and the ConvNet to obtain a representation for 3D voxels and images respectively in the common latent space.
  • The authors' experiments demonstrate that: (1) the representation is generative in 3D, permitting reconstruction of novel CAD models; (2) the representation is predictable from 2D, allowing them to predict the full 3D voxels of an object from an image, as well as do fast CAD model retrieval from a natural image; and (3) that the learned space has a number of good properties, such as being smooth, carrying class-discriminative information, and allowing vector arithmetic.
  • Among a large body of works trying to infer 3D representations from images, the approach is most related to a group of works using renderings of 3D CAD models to predict properties such as object viewpoint [35] or class [34], among others [33,10,27].
  • Autoencoder Network Architecture: The autoencoder takes a 20 × 20 × 20 voxel grid representation of the CAD model as input.
  • The images are generated by rendering the 3D model and the network is trained in a three stage procedure.
  • The encoder generates the embedding for the voxel and the image network is trained to regress the embedding.
  • The authors first verify that the learned representation models the space of voxels well in a number of ways: that it is reconstructive, smooth, and can be used to distinguish different classes of objects (Sec. 4.2).
  • The authors show that the learned space is smooth, by computing reconstructions for linear interpolation between latent representations of randomly picked test models.
Conclusion
  • The direct baselines test whether the auto-encoder’s low-dimensional representation is necessary and the without-joint tests whether learning the model to be jointly generative and predictable is important.
  • Predicting voxels again performs worse compared to predicting the latent space and reconstructing, validating the idea of using a lower-dimensional representation of objects.
  • Poor performance tends to result from images containing multiple objects, causing the network to predict the representation for the “wrong” object out of the ambiguous input.
Tables
  • Table1: Reconstruction performance using AP on test data
  • Table2: Average Precision for Voxel Prediction on the CAD test set. The Proposed TL-Network outperforms the baselines on each object
  • Table3: Average Precision for Voxel Prediction on the IKEA dataset
  • Table4: Mean recall @10 of ground truth model in retrievals for our method and baseline described in Sec. 4.4
Download tables as Excel
Related work
  • Our work aims to produce a representation that is generative in 3D and predictable from 2D and thus touches on two long-standing and important questions in computer vision: how do we represent 3D objects in a vector space and how do we recognize this representation in images?

    Learning an embedding, or vector representation of visual objects is a well studied problem in computer vision. In the seminal work of Olshausen and Field [26], the objective was to obtain a representation that was sparse and could reconstruct the pixels. Since then, there has been a lot of work in this reconstructive vein. For a long time, researchers focused on techniques such as stacked RBMs or autoencoders [12,36] or DBMs [30], and more recently, this has taken the form of generative adversarial models [9]. This line of work, however, has focused on building a 2D generative model of the pixels themselves. In this case, if the representation captures any 3D properties, it is modeled implicitly. In contrast, we focus on explicitly modeling the 3D shape of the world. Thus, our work is most similar to a number of recent exceptions to the 2D end-to-end approach. Dosovitskiy et al [5] used 3D CAD models to learn a parameterized generative model for objects and Kulkarni et al [19] introduced a technique to guide the latent representation of a generative model to explicitly model certain 3D properties. While they use 3D data like our work, they use it to build a generative model for 2D images. Our work is complementary: their work can generate the pixels for a chair and ours can generate the voxels (and thus, help an agent or robot to interact with it).
Funding
  • This work was partially supported by Siebel Scholarship to RG, NDSEG Fellowship to DF and Bosch Young Faculty Fellowship to AG
  • This material is based on research partially sponsored by ONR MURI N000141010934, ONR MURI N000141612007, NSF1320083 and a gift from Google
Reference
  • Aubry, M., Maturana, D., Efros, A., Russell, B., Sivic, J.: Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of cad models. In: CVPR (2014)
    Google ScholarFindings
  • Chaudhuri, S., Kalogerakis, E., Guibas, L., Koltun, V.: Probabilistic reasoning for assembly-based 3D modeling. SIGGRAPH (2011)
    Google ScholarLocate open access versionFindings
  • Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. CoRR abs/1604.00449 (2016)
    Findings
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)
    Google ScholarFindings
  • Dosovitskiy, A., Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: CVPR (2015)
    Google ScholarFindings
  • Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
    Google ScholarFindings
  • Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014)
    Google ScholarFindings
  • Fouhey, D.F., Gupta, A., Hebert, M.: Data-driven 3D primitives for single image understanding. In: ICCV (2013)
    Google ScholarFindings
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)
    Google ScholarFindings
  • Gupta, S., Arbelaez, P.A., Girshick, R.B., Malik, J.: Inferring 3D object pose in RGB-D images. CoRR (2015)
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. CoRR abs/1502.01852 (2015)
    Findings
  • Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science (2006)
    Google ScholarLocate open access versionFindings
  • Hoiem, D., Efros, A.A., Hebert, M.: Recovering surface layout from an image. In: IJCV (2007)
    Google ScholarFindings
  • Huang, Q., Wang, H., Koltun, V.: Single-view reconstruction via joint analysis of image and shape collections. SIGGRAPH 34(4) (2015)
    Google ScholarLocate open access versionFindings
  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
    Findings
  • Kalogerakis, E., Chaudhuri, S., Koller, D., Koltun, V.: A Probabilistic Model of Component-Based Shape Synthesis. SIGGRAPH (2012)
    Google ScholarLocate open access versionFindings
  • Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)
    Google ScholarFindings
  • Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1097–1105 (2012)
    Google ScholarFindings
  • Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network. In: NIPS (2015)
    Google ScholarFindings
  • Li, Y., Pirk, S., Su, H., Qi, C.R., J., G.L.: FPNN: Field probing neural networks for 3d data. CoRR abs/1605.06240 (2016)
    Findings
  • Li, Y., Su, H., Qi, C.R., Fish, N., Cohen-Or, D., Guibas, L.J.: Joint embeddings of shapes and images via cnn image purification. ACM TOG (2015)
    Google ScholarLocate open access versionFindings
  • Lim, J.J., Khosla, A., Torralba, A.: FPM: fine pose parts-based model with 3D CAD models. In: ECCV (2014)
    Google ScholarFindings
  • Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: Fine pose estimation. In: ICCV (2013)
    Google ScholarFindings
  • Maturana, D., Scherer, S.: VoxNet: A 3D Convolutional Neural Network for RealTime Object Recognition. In: IROS (2015)
    Google ScholarFindings
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
    Google ScholarFindings
  • Olshausen, B., Field, D.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature (1996)
    Google ScholarLocate open access versionFindings
  • Peng, X., Sun, B., Ali, K., Saenko, K.: Exploring invariances in deep convolutional neural networks using synthetic images. CoRR (2014)
    Google ScholarLocate open access versionFindings
  • Princeton ModelNet: http://modelnet.cs.princeton.edu/
    Findings
  • Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)
    Findings
  • Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: AISTATS. vol. 5 (2009) 31. Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single still image. TPAMI 30(5), 824–840 (2008)
    Google ScholarLocate open access versionFindings
  • 32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
    Findings
  • 33. Stark, M., Goesele, M., Schiele, B.: Back to the future: Learning shape models from 3D CAD data. In: BMVC (2010)
    Google ScholarFindings
  • 34. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.G.: Multi-view convolutional neural networks for 3d shape recognition. In: ICCV (2015)
    Google ScholarFindings
  • 35. Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In: ICCV (2015)
    Google ScholarFindings
  • 36. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. (2010)
    Google ScholarLocate open access versionFindings
  • 37. Wang, X., Fouhey, D.F., Gupta, A.: Designing deep networks for surface normal estimation. In: CVPR (2015)
    Google ScholarFindings
  • 38. Wu, J., Xue, T., Lim, J.J., Tian, Y., Tenenbaum, J.B., Torralba, A., Freeman, W.T.: Single image 3d interpreter network. In: ECCV (2016)
    Google ScholarFindings
  • 39. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D shapenets: A deep representation for volumetric shapes. In: CVPR (2015)
    Google ScholarFindings
  • 40. Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: A benchmark for 3d object detection in the wild. In: WACV (2014)
    Google ScholarFindings
  • [1] Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-specific object reconstruction from a single image. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • [2] Y. Li, H. Su, C. R. Qi, N. Fish, D. Cohen-Or, and L. J. Guibas. Joint embeddings of shapes and images via cnn image purification. ACM TOG, 2015.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科