💎 GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Mariam Hassan★1, Sebastian Stapf★2, Ahmad Rahimi★1, Pedro M B Rezende★2, Yasaman Haghighi♦1,
David Brüggemann♦3, Isinsu Katircioglu♦3, Lin Zhang♦3, Xiaoran Chen♦3, Suman Saha♦3,
Marco Cannici♦4, Elie Aljalbout♦4, Botao Ye♦5, Xi Wang♦5, Aram Davtyan2,
Mathieu Salzmann1,3, Davide Scaramuzza4, Marc Pollefeys5, Paolo Favaro2, Alexandre Alahi1
1École Polytechnique Fédérale de Lausanne (EPFL), 2University of Bern,
3Swiss Data Science Center, 4University of Zurich, 5ETH Zurich
Main Contributors    Data Contributors
arXiv Code

Unconditional Generations

We show examples of unconditional generations from the model in diverse scenes with different driving dynamics.

Ego Control

We show examples of ego-motion controllability. All videos are generated by GEM using the same starting frame but with different trajectory control input.
We observe that the model follows the control signals and generates realistic scenes.

Object Manipulation

GEM can move objects in the scene using DINO features.
In the following examples, we show an unconditional generation by GEM and the same generation with motion control.
The green box indidcates the source DINO features and the blue ones indicate the target position tokens used.
We observe that the object moves from the green box to the blue box.

Unconditional Generation

Object Motion Control

Object Insertion

Unconditional Generation

Insertion Control

In the following example, we insert a car on the left and control the motion of another car on the right.

Human Pose Manipulation

GEM can use human poses to control the motion of pedestrians in the scene.
In this examples, the pedesterians are crossing the street or stopping based on the human poses controls.

Long Generation

We compare our long generation with the only world model trained on OpenDV capable of generating long sequences.
We observe that our generations have higher ego motion temporal consistency and more realistic dynamics.

GEM

Vista

Interesting Observations

We show interesting behaviors observed in the generated videos.
These behaviors do not necessarily exist in the ground truth videos, but emerge from the model's learned dynamics.

Break Lights go off before moving

Smooth takeover dynamics on a long generation

MultiModal

GEM generates two modalities simultaneously: RGB and Depth. We show examples of multimodal generations.

MultiDomain

GEM is finetuned on two other ego centric domains and we observe it quickly adapts to these new domains.

1-Drone Flights

Drone Flights GIF Drone Flights GIF Drone Flights GIF Drone Flights GIF

2-Human EgoCentric

Human EgoCentric GIF Human EgoCentric GIF Human EgoCentric GIF Human EgoCentric GIF

Pseudo-Labelling

Some visualisations of the outputs of our pseudo-labeling pipeline.