ECCV 2024 Workshop Foundation Models for 3D Humans¶
约 564 个字 预计阅读时间 3 分钟
Information
Siyu Tang¶
Siyu Tang: Learning foundation models for 3D humans: a data request
-
We already have: Sapiens: Foundation for Human Vision Models as 2D human foundation models.
-
Leading question: foundation models for 3D humans in 3D environments.
- "3D humans": shape, motion, behavior
- "3D environments": human-centric, recon & interaction
- reconstruction, predicting, understanding
- Problem: lack of data.
The target is to improve the perception models for 3D humans in 3D environments.
Data Sources¶
- From large video model (not Tang's topic).
- Synthetic data
- Bedlam, EgoGen, etc.
- rich & accurate annotations
- controllability
- modify the distribution for certain application
- BUT human behavior synthesis is a hard problem
- Monocular videos (online videos)
- PROX, ProxyCap, etc.
- requires better reconstruct videos
- diverse motion and appearance
- rich semantics
- BUT limited 3D annotations
- Embodied egocentric captures
- Nymeria, EgoBody, etc.
- extended temporal duration (more and more embodied devices), i.e. long videos
- rich & close hand-object interaction data
- multi-modality data (humans + audios + locations + scenes + ...)
- BUT limited observations for mocap (应该指的是对 device 携带者的 cap 总是 truncated 的?)
- All three are complementary.
Several (Tang's) Works Answering the Question¶
Labeling Monocular Videos¶
I.e., use motion capture from monocular videos to generate Pseudo GT from monocular videos.
- Challenges
- Noisy 2D detections (jitter)
- Occlusions
- Solutions
- learn motion priors - e.g. RoHM - noisy to smooth - decouple trajectory and pose estimation - pose and trajectory should help with each other jointly - BUT not real time (efficiency is important for scaling up data) - BUT scene constraint not considered - BUT ignore the human appearance - InstrinsicAvatar
Synthesized Data¶
Generate human motion data.
Temporal Spectrum of Human Behaviors
- Movement Primitive: have semantics meaning somewhat while still have a lot of data
- 为了训练一个模型,我们需要“建模的 motion 粒度多小”?可能没必要建模 action。
- Text-aligned latent motion space (combine with LLM)
- generative, controllable, diverse, casual
-
easy to control to compose long term complex activities
-
Solutions:
-
e.g. DART
- Online text-to-motion
- latent motion primitives space
- auto regress
- interaction:
- motion in-between
- human scene interaction (在没有上下楼数据的情况下模型能够推理出正确的上下楼行为)
-
e.g. EgoGen (是 ego data 和 synthetic data 的结合)
- 一整套的生成
- train on static environment but extend on dynamic situation
-
Xavier Puig¶
Xavier Puig: Human Models for Embodied AI
One focus of Embodied AI: robots that can collaborate with humans.
Tasks: - Object & Vision & Language Navigation - Instruction Following, Object Manipulation - Modality (Audio, Vision, Tactile Sensing)
Challenges: - Dynamic environment - Reason about the human - Adaptation to new humans
Direction: 1. Efficient & Realistic Human Simulation - E.g., Habitat 3.0 (in the following three aspects) - To be efficient: precomputing. - Hierarchical: high-level controller -> motion primitives -> joint angles. - BUT not rich enough. - Controllable Human-Object Interaction Synthesis - challenges: - scene & texts (lack data) - realistic interactions (contact, floating, etc.) - solutions: 2 stage - planning module - interaction synthesis module - test-time guidance with contact terms - OMOMO Dataset 2. Human-Robot Collaboration Tasks - Social Navigation (robot finds & follows sb.) 3. Human-In-The-Loop Evaluation
Human Models for Embodied AI - Need to focus on human-centric scenarios in Embodied AI. - Generative human models that can be used for robot training and evaluation. - Predictive models of human behavior for better human-robot collaboration.
创建日期: 2024年11月24日 16:42:04