# LCollision: Fast Generation of Collision-Free Human Poses using Learned Non-Penetration Constraints

Qingyang Tan
Cited by: 0|Bibtex|Views6
Weibo:
Hybrid Ranking, Potential Energy, and Entropy Loss: exact hard constraints correspond to a binary loss, this loss should be differentiable so that constrained optimizations can be guided by gradient information

Abstract:

We present a learning-based method (LCollision) that synthesizes collision-free 3D human poses. At the crux of our approach is a novel deep architecture that simultaneously decodes new human poses from the latent space and classifies the collision status. These two components of our architecture are used as the objective function and su...More

Code:

Data:

0
Introduction
• There has been considerable work on developing learning algorithms on 3D objects, represented as point clouds (Qi et al 2017), meshes (Hanocka et al 2019), volumetric grids (Wang, Liu, and Tong 2020), and physical objects (Li et al 2019).
• These methods learn a manifold of plausible human poses from a dataset, represented as the latent space of a deep autoencoder.
• After learning the feasible domain, solving a constrained optimization for a collision-free human pose with 2161 vertices takes 2.095 iterations and 0.25s on average.
Highlights
• There has been considerable work on developing learning algorithms on 3D objects, represented as point clouds (Qi et al 2017), meshes (Hanocka et al 2019), volumetric grids (Wang, Liu, and Tong 2020), and physical objects (Li et al 2019). As these algorithms are used for different applications, a major challenge is accounting for user requirements and physics-based constraints
• Each point on the human body is softly assigned to a set of local domains, and the collision penalty loss is decomposed to these local domains
• Hybrid Ranking, Potential Energy, and Entropy Loss: exact hard constraints correspond to a binary loss, this loss should be differentiable so that constrained optimizations can be guided by gradient information
• We propose using a penetration depth formulation (Zhang et al 2007) as collision metric to offer gradient direction and use ranking loss to enforce the relative differences between each sample
• We review related works on human pose estimation and synthesis, collision detection and response, and deep network training with hard constraints
Results
• The authors review related works on human pose estimation and synthesis, collision detection and response, and deep network training with hard constraints.
• The authors' approach is based on recent learning methods (Tan et al 2018a; Tretschk et al 2020) that use 3D meshes to generate detailed human poses.
• The authors can use different collision handling methods to avoid penetrations in a 3D mesh of a human pose.
• Training Deep Networks with Hard Constraints: An additional layer of challenge is to incorporate collision handling into a deep learning framework.
• The authors give an overview of the process of computing the embedding space for human pose generation and highlight the collision-free constraints that LCollision tries to satisfy.
• The learned domain decomposition enhances the reusability and explainability of the neural network but is used to model the local collisions between body sub-parts, as explained in Section 3.3.
• This constraint is ignored by previous neural-network-based human pose generation methods.
• Given a mesh G, the authors use the FCL library (Pan, Chitta, and Manocha 2012) to compute the squared penetration depth PD2p,q of each colliding triangle pair.
• The authors train a single classifier MLPclassifier(S1, ⋯, S Z0 ) to summarize the information and predict whether there are any collisions throughout the human body, i.e. MLPclassifier is an indicator of whether Ssum = 0.
• To profile the collision response solver quantitatively, the authors sample a set of 3000 random human poses by randomizing Zall for both the SCAPE and Swing datasets.
• On the SCAPE dataset, the method achieves a success rate of 85.6%, and the authors observe a relative decrease of 80.9% for these models compared to the original penetration depth energy.
Conclusion
• The authors present a method for learning the collision-free human pose sub-manifold.
• The authors use a mesh embedding autoencoder to learn a full human pose manifold and augment it with an additional component to classify the collision and other hard constraints.
• The authors learn to predict the penetration depths aggregated to each sub-domain and use a binary classifier to predict whether a given mesh has any collisions.
Summary
• There has been considerable work on developing learning algorithms on 3D objects, represented as point clouds (Qi et al 2017), meshes (Hanocka et al 2019), volumetric grids (Wang, Liu, and Tong 2020), and physical objects (Li et al 2019).
• These methods learn a manifold of plausible human poses from a dataset, represented as the latent space of a deep autoencoder.
• After learning the feasible domain, solving a constrained optimization for a collision-free human pose with 2161 vertices takes 2.095 iterations and 0.25s on average.
• The authors review related works on human pose estimation and synthesis, collision detection and response, and deep network training with hard constraints.
• The authors' approach is based on recent learning methods (Tan et al 2018a; Tretschk et al 2020) that use 3D meshes to generate detailed human poses.
• The authors can use different collision handling methods to avoid penetrations in a 3D mesh of a human pose.
• Training Deep Networks with Hard Constraints: An additional layer of challenge is to incorporate collision handling into a deep learning framework.
• The authors give an overview of the process of computing the embedding space for human pose generation and highlight the collision-free constraints that LCollision tries to satisfy.
• The learned domain decomposition enhances the reusability and explainability of the neural network but is used to model the local collisions between body sub-parts, as explained in Section 3.3.
• This constraint is ignored by previous neural-network-based human pose generation methods.
• Given a mesh G, the authors use the FCL library (Pan, Chitta, and Manocha 2012) to compute the squared penetration depth PD2p,q of each colliding triangle pair.
• The authors train a single classifier MLPclassifier(S1, ⋯, S Z0 ) to summarize the information and predict whether there are any collisions throughout the human body, i.e. MLPclassifier is an indicator of whether Ssum = 0.
• To profile the collision response solver quantitatively, the authors sample a set of 3000 random human poses by randomizing Zall for both the SCAPE and Swing datasets.
• On the SCAPE dataset, the method achieves a success rate of 85.6%, and the authors observe a relative decrease of 80.9% for these models compared to the original penetration depth energy.
• The authors present a method for learning the collision-free human pose sub-manifold.
• The authors use a mesh embedding autoencoder to learn a full human pose manifold and augment it with an additional component to classify the collision and other hard constraints.
• The authors learn to predict the penetration depths aggregated to each sub-domain and use a binary classifier to predict whether a given mesh has any collisions.
Tables
• Table1: We compare our method (Ours) with 4 baselines: Lentropy + LP D, Lentropy + Lrank, Lentropy, and ND (no collision decomposition). For each method, we train on the smaller dataset with M = 5 × 104 meshes. For each trained dataset, we compare their accuracy in terms of predicting penetration depth energies (MSE), ranking penetration depth energies (RANK), and classifying collision-free meshes (CLASSIFY). Compared with Lentropy+LP D, Lentropy+Lrank, and Lentropy, we see the power of our hybrid loss to improve the overall accuracy of collision predictions. The improvement from N D to our method demonstrates that the penetration decomposition is meaningful in our framework
• Table2: We study the robustness of our method in terms of dataset sizes. Increasing the dataset size M can significantly boost the collision detection accuracy (CLASSIFY). This result implies that learning to predict collisions is challenging, and a larger training dataset can help improve the overall results
• Table3: We show the collision detection running time for our method compared with one exact collision detection algorithm (<a class="ref-link" id="cPan_et+al_2012_a" href="#rPan_et+al_2012_a"><a class="ref-link" id="cPan_et+al_2012_a" href="#rPan_et+al_2012_a">Pan, Chitta, and Manocha 2012</a></a>). All datasets have 1.5 × 104 samples. Swing and Jumping have more points than SCAPE (9971 and 10002 vs 2261), and the complexity of (<a class="ref-link" id="cPan_et+al_2012_a" href="#rPan_et+al_2012_a"><a class="ref-link" id="cPan_et+al_2012_a" href="#rPan_et+al_2012_a">Pan, Chitta, and Manocha 2012</a></a>) depends on the number of points, thus we spent more time on them for exact collision checking. While, they all share the same level of latent space size with SCAPE and the running times for our method are similar
Related work
• We review related works on human pose estimation and synthesis, collision detection and response, and deep network training with hard constraints.

Human Pose Estimation & Synthesis: There is considerable work on human pose estimation and synthesis. Earlier methods (Leibe, Seemann, and Schiele 2005) represent a pedestrian as a bounding box. An improved algorithm was proposed in (Agarwal and Triggs 2005), and this algorithm predicts the 55-D joint angles for a skeletal human pose. More accurate prediction results have been proposed in (Rogez et al 2008) using random forests and in (Toshev and Szegedy 2014) using convolutional neural networks. Our approach is based on recent learning methods (Tan et al 2018a; Tretschk et al 2020) that use 3D meshes to generate detailed human poses. Mesh-based representations are inherently difficult to learn due to the intrinsic high-dimensionality, and the resulting algorithms can produce sub-optimal results that may consist of various artifacts such as self-penetrations, noisy mesh surfaces, and flipped meshes. In view of these problems, (Villegas et al 2018) only computes skeletal poses using learning and then uses skinning to recover the mesh-based representation. However, this approach requires additional skeleton-mesh correspondence information, which is typically unavailable in many datasets including SCAPE (Anguelov et al 2005).
Funding
• We propose using a penetration depth formulation (Zhang et al 2007) as collision metric to offer gradient direction and then use ranking loss to enforce the relative differences between each sample. We have implemented these algorithms and evaluated the performance on the SCAPE dataset (Anguelov et al 2005), the MIT-Swing dataset (Vlasic et al 2008), and the MIT Jumping dataset (Vlasic et al 2008). Combining these techniques, we achieve an accuracy of 94.1%, a false positive rate of 6.1%, and a false negative rate of 5.7% when predicting collisions for 2.5 × 106 randomized testing poses from these datasets
• To classify collision-free meshes, we use the rate of success (CLASSIFY) over the 0.3M test meshes. From this ablation study, we compare N D and our method to find that penetration decomposition can improve the accuracy of collision predictions
• Being a learning-based method, our collision predictor cannot achieve a 100% success rate, in contrast to analytic methods (Barbicand James 2010)
Study subjects and analysis
major datasets: 3
We show that solving our constrained optimization formulation can resolve significantly more collision artifacts than prior learning algorithms. Furthermore, in a large test set of $2.5\times 10^6$ randomized poses from three major datasets, our architecture achieves a collision-prediction accuracy of $94.1\%$ with $80\times$ speedup over exact collision detection algorithms. To the best of our knowledge, LCollision is the first approach that can obtain high accuracy in terms of handling non-penetration and collision constraints in a learning framework

datasets: 3
This stage uses the loss: L = wP DLP D + wrankLrank + wentropyLentropy, which is configured with wP D = 5, wrank = 2, and wentropy = 2 and trained using a learning rate of 0.001 and a batch size of 32 over 30 epochs. We evaluate our method on three datasets: the SCAPE dataset (Anguelov et al 2005) with N = 71 meshes, the MIT-Swing dataset (Vlasic et al 2008) with N = 150 meshes, and the MIT Jumping dataset (Vlasic et al 2008) with N = 150 meshes. For each dataset, we use all the meshes to train the embedding space during the first stage

samples: 104
1.02s 0.91s 1.13s. 81x 342x 282x the test set of 5 × 104 samples (1.5 × 104 samples) for the SCAPE, Swing, and Jumping datasets. To achieve the best performance for (Pan, Chitta, and Manocha 2012), we run their method using 15 threads in parallel and stop when one collision occurs or the process is reported collision-free

Reference
• Agarwal, A.; and Triggs, B. 2005. Recovering 3D human pose from monocular images. IEEE transactions on pattern analysis and machine intelligence 28(1): 44–58.
• Agrawal, A.; Amos, B.; Barratt, S.; Boyd, S.; Diamond, S.; and Kolter, J. Z. 2019. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, 9558–9570.
• Anguelov, D.; Srinivasan, P.; Koller, D.; Thrun, S.; Rodgers, J.; and Davis, J. 2005. SCAPE: shape completion and animation of people. In ACM SIGGRAPH, 408–416.
• Bagautdinov, T.; Wu, C.; Saragih, J.; Fua, P.; and Sheikh, Y. 2018. Modeling facial geometry using compositional vaes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3877–3886.
• Barbic, J.; and James, D. L. 2010. Subspace Self-Collision Culling. ACM Trans. on Graphics (SIGGRAPH 2010) 29(4): 81:1–81:9.
• Boggs, P. T.; and Tolle, J. W. 1995. Sequential quadratic programming. Acta numerica 4: 1–51.
• Bouritsas, G.; Bokhnyak, S.; Ploumpis, S.; Bronstein, M.; and Zafeiriou, S. 2019. Neural 3d morphable models: Spiral convolutional networks for 3d shape representation learning and generation. In Proceedings of the IEEE International Conference on Computer Vision, 7213–7222.
• Bridson, R.; Fedkiw, R.; and Anderson, J. 2002. Robust treatment of collisions, contact and friction for cloth animation. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, 594–603.
• Burgard, W.; Brock, O.; and Stachniss, C. 2008. A Fast and Practical Algorithm for Generalized Penetration Depth Computation, 265–272.
• Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Hamilton, N.; and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, 89–96.
• Gao, L.; Lai, Y.-K.; Yang, J.; Ling-Xiao, Z.; Xia, S.; and Kobbelt, L. 2019. Sparse data driven mesh deformation. IEEE transactions on visualization and computer graphics.
• Hanocka, R.; Hertz, A.; Fish, N.; Giryes, R.; Fleishman, S.; and Cohen-Or, D. 2019. MeshCNN: A network with an edge. ACM Transactions on Graphics 38(4): 90. ISSN 15577368. doi:10.1145/3306346.3322959.
• Hoffer, E.; and Ailon, N. 2015. Deep metric learning using triplet network. In International Workshop on SimilarityBased Pattern Recognition, 84–92. Springer.
• Kervadec, H.; Dolz, J.; Tang, M.; Granger, E.; Boykov, Y.; and Ayed, I. B. 2019. Constrained-CNN losses for weakly supervised segmentation. Medical image analysis 54: 88– 99.
• Kim, Y.; Lin, M.; and Manocha, D. 2018. Collision and proximity queries. Handbook of Discrete and Computational Geometry.
• Kim, Y. J.; Otaduy, M. A.; Lin, M. C.; and Manocha, D. 2002. Fast penetration depth computation for physicallybased animation. In Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, 23–31.
• Leibe, B.; Seemann, E.; and Schiele, B. 2005. Pedestrian detection in crowded scenes. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, 878–885. IEEE.
• Li, Y.; Wu, J.; Tedrake, R.; Tenenbaum, J. B.; and Torralba, A. 2019. Learning Particle Dynamics for Manipulating Rigid Bodies, Deformable Objects, and Fluids. In ICLR.
• Marquez Neila, P.; Salzmann, M.; and Fua, P. 2017. Imposing Hard Constraints on Deep Networks: Promises and Limitations URL http://infoscience.epfl.ch/record/262884.
• Nandwani, Y.; Pathak, A.; Singla, P.; et al. 2019. A Primal Dual Formulation For Deep Learning With Constraints. In Advances in Neural Information Processing Systems, 12157–12168.
• Pan, J.; Chitta, S.; and Manocha, D. 2012. FCL: A general purpose library for collision and proximity queries. In 2012 IEEE International Conference on Robotics and Automation, 3859–3866. IEEE.
• Pan, J.; Zhang, X.; and Manocha, D. 2013. Efficient penetration depth approximation using active learning. ACM Transactions on Graphics (TOG) 32(6): 1–12.
• Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
• Pham, T.; De Magistris, G.; and Tachibana, R. 2018. OptLayer - Practical Constrained Optimization for Deep Reinforcement Learning in the Real World. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 6236–6243.
• Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652–660.
• Qiao, Y.-L.; Liang, J.; Koltun, V.; and Lin, M. C. 2020. Scalable Differentiable Physics for Learning and Control. arXiv preprint arXiv:2007.02168.
• Ranjan, A.; Bolkart, T.; Sanyal, S.; and Black, M. J. 2018. Generating 3D faces using convolutional mesh autoencoders. In Proceedings of the European Conference on Computer Vision (ECCV), 704–720.
• Ravi, S. N.; Dinh, T.; Lokhande, V. S.; and Singh, V. 2019. Explicitly imposing constraints in deep networks via conditional gradients gives improved generalization and faster convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 4772–4779.
• Rogez, G.; Rihan, J.; Ramalingam, S.; Orrite, C.; and Torr, P. H. 2008. Randomized trees for human pose detection. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, 1–8. IEEE.
• Shi, X.; Zhou, K.; Tong, Y.; Desbrun, M.; Bao, H.; and Guo, B. 2007. Mesh puppetry: cascading optimization of mesh deformation with inverse kinematics. In ACM SIGGRAPH 2007 papers, 81–es.
• Smith, B.; Goes, F. D.; and Kim, T. 2018. Stable neohookean flesh simulation. ACM Transactions on Graphics (TOG) 37(2): 1–15.
• Tan, Q.; Gao, L.; Lai, Y.-K.; and Xia, S. 2018a. Variational autoencoders for deforming 3d mesh models. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5841–5850.
• Tan, Q.; Gao, L.; Lai, Y.-K.; Yang, J.; and Xia, S. 2018b. Mesh-based autoencoders for localized deformation component analysis. In Thirty-Second AAAI Conference on Artificial Intelligence.
• Tang, M.; Curtis, S.; Yoon, S.-E.; and Manocha, D. 2009. ICCD: Interactive continuous collision detection between deformable models using connectivity-based culling. IEEE Transactions on Visualization and Computer Graphics 15(4): 544–557.
• Teng, Y.; Otaduy, M. A.; and Kim, T. 2014. Simulating Articulated Subspace Self-Contact. ACM Trans. Graph. 33(4). ISSN 0730-0301. doi:10.1145/2601097.2601181. URL https://doi.org/10.1145/2601097.2601181.
• Toshev, A.; and Szegedy, C. 2014. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1653–1660.
• Tretschk, E.; Tewari, A.; Zollhofer, M.; Golyanik, V.; and Theobalt, C. 2020. DEMEA: Deep Mesh Autoencoders for Non-Rigidly Deforming Objects. European Conference on Computer Vision (ECCV).
• Vanderbei, R. J. 1999. LOQO user’s manual—version 3.10. Optimization methods and software 11(1-4): 485–514.
• Villegas, R.; Yang, J.; Ceylan, D.; and Lee, H. 2018. Neural kinematic networks for unsupervised motion retargetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8639–8648.
• Vlasic, D.; Baran, I.; Matusik, W.; and Popovic, J. 2008. Articulated mesh animation from multi-view silhouettes. In ACM SIGGRAPH 2008 papers, 1–9.
• Wang, P.-S.; Liu, Y.; and Tong, X. 2020. Deep Octree-based CNNs with Output-Guided Skip Connections for 3D Shape and Scene Completion.
• Zhang, L.; Kim, Y. J.; Varadhan, G.; and Manocha, D. 2007. Generalized penetration depth computation. ComputerAided Design 39(8): 625–638.
Full Text