💎 GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

CVPR 2025

Mariam Hassan^★1, Sebastian Stapf^★2, Ahmad Rahimi^★1, Pedro M B Rezende^★2, Yasaman Haghighi^♦1,
David Brüggemann^♦3, Isinsu Katircioglu^♦3, Lin Zhang^♦3, Xiaoran Chen^♦3, Suman Saha^♦3,
Marco Cannici^♦4, Elie Aljalbout^♦4, Botao Ye^♦5, Xi Wang^♦5, Aram Davtyan²,
Mathieu Salzmann^1,3, Davide Scaramuzza⁴, Marc Pollefeys⁵, Paolo Favaro², Alexandre Alahi¹

¹École Polytechnique Fédérale de Lausanne (EPFL), ²University of Bern,
³Swiss Data Science Center, ⁴University of Zurich, ⁵ETH Zurich

^★ Main Contributors ^♦ Data Contributors

arXiv Code

Unconditional Generations

We show examples of unconditional generations from the model in diverse scenes with different driving dynamics.

Ego Control

We show examples of ego-motion controllability. All videos are generated by GEM using the same starting frame but with different trajectory control input. We observe that the model follows the control signals and generates realistic scenes.

Left trajectory input

Right trajectory input

Feature-Guided Object Motion

GEM can reposition objects in the scene by leveraging DINOv2 features. Below, we demonstrate GEM’s unconditional generation alongside the same generation incorporating motion control.

The green box highlights the source position of the extracted DINO features, while the blue box marks the target positions provided as input tokens. We observe that the object moves smoothly from the position indicated by the green box to the position indicated by the blue box.

Unconditional Generation

Object Motion Control

Object Insertion

We demonstrate GEM's capability to insert objects into scenes and precisely control their motion. In the following examples, we insert a new car into the scene and can even control the movement of existing cars.

Unconditional Generation

Insertion Control

Human Pose Control

GEM can use human poses to control pedestrian motion within the scene. In these examples, pedestrians either cross the street or stop according to the provided controls.

Move poses control

Static poses control

Long Generation

We compare our long generation with the only world model trained on OpenDV capable of generating long sequences. We observe that our generations have higher ego motion temporal consistency and more realistic dynamics.

GEM's Long Generation

Vista's Long Generation

Interesting Observations

We show interesting behaviors observed in the generated videos. These behaviors do not necessarily exist in the ground truth videos, but emerge from the model's learned dynamics.

Break lights go off before moving

Smooth takeover dynamics on a long generation

Multimodal

GEM generates two modalities simultaneously: RGB and Depth. We show examples of multimodal generations.

Multidomain

GEM is finetuned on two other ego centric domains and we observe it quickly adapts to these new domains.

1. Drone Flights

2. Human Egocentric

Pseudo-labeling

Below, we present visualizations demonstrating our pseudo-labeling pipeline’s capability to generate skeleton poses, depth maps, and ego-motion trajectories.