AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We propose a novel architecture, called the TL-embedding network, to learn an embedding space with these properties
Learning a Predictable and Generative Vector Representation for Objects.
ECCV, (2016): 484-499
What is a good vector representation of an object? We believe that it should be generative in 3D, in the sense that it can produce new 3D objects; as well as be predictable from 2D, in the sense that it can be perceived from 2D images. We propose a novel architecture, called the TL-embedding network, to learn an embedding space with these...More
PPT (Upload PPT)
- What is a good vector representation for objects? On the one hand, there has been a great deal of work on discriminative models such as ConvNets [18,32] mapping 2D pixels to semantic labels.
- There is an alternate line of work focusing on learning to generate objects using 3D CAD models and deconvolutional networks [5,19].
- These criteria are often at odds with each other: modeling occluded voxels in 3D is useful for generating objects but very difficult to predict from an image.
- What is a good vector representation for objects? On the one hand, there has been a great deal of work on discriminative models such as ConvNets [18,32] mapping 2D pixels to semantic labels
- In contrast to the purely discriminative paradigm, these approaches explicitly address the 3D nature of objects and have shown success in generative tasks; they offer no guarantees that their representations can be inferred from images and have not been shown to be useful for natural image tasks
- We propose to unify these two threads of research together and propose a new vector representation of objects
- Our experiments demonstrate that: (1) our representation is generative in 3D, permitting reconstruction of novel CAD models; (2) our representation is predictable from 2D, allowing us to predict the full 3D voxels of an object from an image, as well as do fast CAD model retrieval from a natural image; and (3) that the learned space has a number of good properties, such as being smooth, carrying class-discriminative information, and allowing vector arithmetic
- Among a large body of works trying to infer 3D representations from images, our approach is most related to a group of works using renderings of 3D CAD models to predict properties such as object viewpoint  or class , among others [33,10,27]
- We report a comparison with  on their 315 image, 105 model labeled evaluation set. ’s method is an approach that is specific to nearest-neighbor model retrieval and has a number of advantages over our approach
- The authors take the 3D voxel map of a CAD model as well as its 2D rendered image and jointly optimize the components.
- The authors can use the autoencoder and the ConvNet to obtain a representation for 3D voxels and images respectively in the common latent space.
- The authors' experiments demonstrate that: (1) the representation is generative in 3D, permitting reconstruction of novel CAD models; (2) the representation is predictable from 2D, allowing them to predict the full 3D voxels of an object from an image, as well as do fast CAD model retrieval from a natural image; and (3) that the learned space has a number of good properties, such as being smooth, carrying class-discriminative information, and allowing vector arithmetic.
- Among a large body of works trying to infer 3D representations from images, the approach is most related to a group of works using renderings of 3D CAD models to predict properties such as object viewpoint  or class , among others [33,10,27].
- Autoencoder Network Architecture: The autoencoder takes a 20 × 20 × 20 voxel grid representation of the CAD model as input.
- The images are generated by rendering the 3D model and the network is trained in a three stage procedure.
- The encoder generates the embedding for the voxel and the image network is trained to regress the embedding.
- The authors first verify that the learned representation models the space of voxels well in a number of ways: that it is reconstructive, smooth, and can be used to distinguish different classes of objects (Sec. 4.2).
- The authors show that the learned space is smooth, by computing reconstructions for linear interpolation between latent representations of randomly picked test models.
- The direct baselines test whether the auto-encoder’s low-dimensional representation is necessary and the without-joint tests whether learning the model to be jointly generative and predictable is important.
- Predicting voxels again performs worse compared to predicting the latent space and reconstructing, validating the idea of using a lower-dimensional representation of objects.
- Poor performance tends to result from images containing multiple objects, causing the network to predict the representation for the “wrong” object out of the ambiguous input.
- Table1: Reconstruction performance using AP on test data
- Table2: Average Precision for Voxel Prediction on the CAD test set. The Proposed TL-Network outperforms the baselines on each object
- Table3: Average Precision for Voxel Prediction on the IKEA dataset
- Table4: Mean recall @10 of ground truth model in retrievals for our method and baseline described in Sec. 4.4
- Our work aims to produce a representation that is generative in 3D and predictable from 2D and thus touches on two long-standing and important questions in computer vision: how do we represent 3D objects in a vector space and how do we recognize this representation in images?
Learning an embedding, or vector representation of visual objects is a well studied problem in computer vision. In the seminal work of Olshausen and Field , the objective was to obtain a representation that was sparse and could reconstruct the pixels. Since then, there has been a lot of work in this reconstructive vein. For a long time, researchers focused on techniques such as stacked RBMs or autoencoders [12,36] or DBMs , and more recently, this has taken the form of generative adversarial models . This line of work, however, has focused on building a 2D generative model of the pixels themselves. In this case, if the representation captures any 3D properties, it is modeled implicitly. In contrast, we focus on explicitly modeling the 3D shape of the world. Thus, our work is most similar to a number of recent exceptions to the 2D end-to-end approach. Dosovitskiy et al  used 3D CAD models to learn a parameterized generative model for objects and Kulkarni et al  introduced a technique to guide the latent representation of a generative model to explicitly model certain 3D properties. While they use 3D data like our work, they use it to build a generative model for 2D images. Our work is complementary: their work can generate the pixels for a chair and ours can generate the voxels (and thus, help an agent or robot to interact with it).
- This work was partially supported by Siebel Scholarship to RG, NDSEG Fellowship to DF and Bosch Young Faculty Fellowship to AG
- This material is based on research partially sponsored by ONR MURI N000141010934, ONR MURI N000141612007, NSF1320083 and a gift from Google
- Aubry, M., Maturana, D., Efros, A., Russell, B., Sivic, J.: Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of cad models. In: CVPR (2014)
- Chaudhuri, S., Kalogerakis, E., Guibas, L., Koltun, V.: Probabilistic reasoning for assembly-based 3D modeling. SIGGRAPH (2011)
- Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. CoRR abs/1604.00449 (2016)
- Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)
- Dosovitskiy, A., Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: CVPR (2015)
- Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
- Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014)
- Fouhey, D.F., Gupta, A., Hebert, M.: Data-driven 3D primitives for single image understanding. In: ICCV (2013)
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)
- Gupta, S., Arbelaez, P.A., Girshick, R.B., Malik, J.: Inferring 3D object pose in RGB-D images. CoRR (2015)
- He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. CoRR abs/1502.01852 (2015)
- Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science (2006)
- Hoiem, D., Efros, A.A., Hebert, M.: Recovering surface layout from an image. In: IJCV (2007)
- Huang, Q., Wang, H., Koltun, V.: Single-view reconstruction via joint analysis of image and shape collections. SIGGRAPH 34(4) (2015)
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
- Kalogerakis, E., Chaudhuri, S., Koller, D., Koltun, V.: A Probabilistic Model of Component-Based Shape Synthesis. SIGGRAPH (2012)
- Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)
- Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1097–1105 (2012)
- Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network. In: NIPS (2015)
- Li, Y., Pirk, S., Su, H., Qi, C.R., J., G.L.: FPNN: Field probing neural networks for 3d data. CoRR abs/1605.06240 (2016)
- Li, Y., Su, H., Qi, C.R., Fish, N., Cohen-Or, D., Guibas, L.J.: Joint embeddings of shapes and images via cnn image purification. ACM TOG (2015)
- Lim, J.J., Khosla, A., Torralba, A.: FPM: fine pose parts-based model with 3D CAD models. In: ECCV (2014)
- Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing IKEA objects: Fine pose estimation. In: ICCV (2013)
- Maturana, D., Scherer, S.: VoxNet: A 3D Convolutional Neural Network for RealTime Object Recognition. In: IROS (2015)
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
- Olshausen, B., Field, D.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature (1996)
- Peng, X., Sun, B., Ali, K., Saenko, K.: Exploring invariances in deep convolutional neural networks using synthetic images. CoRR (2014)
- Princeton ModelNet: http://modelnet.cs.princeton.edu/
- Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)
- Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: AISTATS. vol. 5 (2009) 31. Saxena, A., Sun, M., Ng, A.Y.: Make3D: Learning 3D scene structure from a single still image. TPAMI 30(5), 824–840 (2008)
- 32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
- 33. Stark, M., Goesele, M., Schiele, B.: Back to the future: Learning shape models from 3D CAD data. In: BMVC (2010)
- 34. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.G.: Multi-view convolutional neural networks for 3d shape recognition. In: ICCV (2015)
- 35. Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In: ICCV (2015)
- 36. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. (2010)
- 37. Wang, X., Fouhey, D.F., Gupta, A.: Designing deep networks for surface normal estimation. In: CVPR (2015)
- 38. Wu, J., Xue, T., Lim, J.J., Tian, Y., Tenenbaum, J.B., Torralba, A., Freeman, W.T.: Single image 3d interpreter network. In: ECCV (2016)
- 39. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D shapenets: A deep representation for volumetric shapes. In: CVPR (2015)
- 40. Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: A benchmark for 3d object detection in the wild. In: WACV (2014)
-  Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-specific object reconstruction from a single image. In CVPR, 2015.
-  Y. Li, H. Su, C. R. Qi, N. Fish, D. Cohen-Or, and L. J. Guibas. Joint embeddings of shapes and images via cnn image purification. ACM TOG, 2015.