We show examples of unconditional generations from the model in diverse scenes with different driving dynamics.
We show examples of ego-motion controllability. All videos are generated by GEM using the same starting frame but with different trajectory control input.
We observe that the model follows the control signals and generates realistic scenes.
GEM can move objects in the scene using DINO features.
In the following examples, we show an unconditional generation by GEM and the same generation with motion control.
The green box indidcates the source DINO features and the blue ones indicate the target position tokens used.
We observe that the object moves from the green box to the blue box.
Unconditional Generation
Object Motion Control
Unconditional Generation
Insertion Control
In the following example, we insert a car on the left and control the motion of another car on the right.
GEM can use human poses to control the motion of pedestrians in the scene.
In this examples, the pedesterians are crossing the street or stopping based on the human poses controls.
We compare our long generation with the only world model trained on OpenDV capable of generating long sequences.
We observe that our generations have higher ego motion temporal consistency and more realistic dynamics.
GEM
Vista
We show interesting behaviors observed in the generated videos.
These behaviors do not necessarily exist in the ground truth videos, but emerge from the model's learned dynamics.
Break Lights go off before moving
Smooth takeover dynamics on a long generation
GEM generates two modalities simultaneously: RGB and Depth. We show examples of multimodal generations.
GEM is finetuned on two other ego centric domains and we observe it quickly adapts to these new domains.
Some visualisations of the outputs of our pseudo-labeling pipeline.