Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
CVPR 2024(2024)
摘要
Despite significant advancements in text-to-motion synthesis, generating
language-guided human motion within 3D environments poses substantial
challenges. These challenges stem primarily from (i) the absence of powerful
generative models capable of jointly modeling natural language, 3D scenes, and
human motion, and (ii) the generative models' intensive data requirements
contrasted with the scarcity of comprehensive, high-quality,
language-scene-motion datasets. To tackle these issues, we introduce a novel
two-stage framework that employs scene affordance as an intermediate
representation, effectively linking 3D scene grounding and conditional motion
generation. Our framework comprises an Affordance Diffusion Model (ADM) for
predicting explicit affordance map and an Affordance-to-Motion Diffusion Model
(AMDM) for generating plausible human motions. By leveraging scene affordance
maps, our method overcomes the difficulty in generating human motion under
multimodal condition signals, especially when training with limited data
lacking extensive language-scene-motion pairs. Our extensive experiments
demonstrate that our approach consistently outperforms all baselines on
established benchmarks, including HumanML3D and HUMANISE. Additionally, we
validate our model's exceptional generalization capabilities on a specially
curated evaluation set featuring previously unseen descriptions and scenes.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要