Under Review

MAD: Motion Appearance Decoupling
for efficient Driving World Models

Ahmad Rahimi*, 1 Valentin Gerard*, 1 Eloi Zablocki2 Matthieu Cord2, 3 Alexandre Alahi1
1EPFL  ·  2Valeo.ai  ·  3Sorbonne Université
Contact: first.last@epfl.ch
PS: The background videos are generated by our MAD-LTX
MAD about Efficiency?

We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively “dressing” the motion with texture and lighting.

Key Findings of MAD

Our experiments show this decoupled approach is exceptionally efficient: adapting SVD [2], we match prior SOTA models with less than 6% of their compute. Scaling to LTX [1], our MAD-LTX model outperforms all open-source competitors.

Proof-of-Concept: MAD-SVD
  • Reproduces VISTA [4] performance starting from the same backbone Stable Video Diffusion (SVD) [2].
  • Using 12× less data (139 hrs vs. 1,700).
  • Using 16× less compute (1,500 GPU-hrs vs. 25,000).
Scaling to SOTA: MAD-LTX 2B & 13B
  • Extremely low training cost: 130 GPU-hrs (2B) and 700 GPU-hrs (13B), using only 130 hours of driving video.
  • SOTA performance among open-source models for video generation quality.
  • Outperforms standard LTX [1] training on both video generation and open-loop planning.
  • Faster inference than previous models, with support for text, ego trajectory, and object control.

Our MAD-LTX Generations

High-Fidelity Appearance Transfer

Our Appearance Synthesizer effectively "dresses" the sparse predicted motion signals with realistic textures and lighting. You can verify the alignment between predicted motion (skeleton) and synthesized appearance (RGB) by moving the vertical slider in each video.

New Agent Generation

A key capability of our Motion Forecaster is the synthesis of agents entering the scene that were not present in the initial frame. The model predicts plausible trajectories for new vehicles and pedestrians, which are then seamlessly rendered by the Appearance Synthesizer, enriching the scene's dynamism.

Implicit Occlusion Modeling

While our pose predictor is trained specifically on cars and pedestrians, it exhibits an implicit understanding of unmodeled occluders, such as trucks. As shown below, the model avoids predicting skeletons where a truck is present (Video 1) or correctly handles the occlusion when a truck enters the scene (Video 2), ensuring that the Appearance Synthesizer does not render conflicting geometry in those regions.

Robust Pedestrian Synthesis

Pedestrians often present a challenge for driving world model due to their non-rigid motion. By conditioning on explicit skeleton sequences, MAD-LTX generates pedestrians with high temporal consistency and realistic gait.

Comparison with other Driving World Models

The generations of MAD-LTX are compared to other baselines, all given the same starting frame and caption (for models that accept captions). We show comparisons at two scales: 13B and 2B parameters. Note that MAD-LTX is trained with significantly less computational resources than the baselines. The first 3 examples show cases where our models generations are clearly better. We have also improved pedestrian motion compared to prior work, while still there is room for improvement compared to Cosmos Predict 2 [5].

Comparison with MAD-LTX-13B

MAD-LTX 13B
700 GPUh
Vista [4]
25000 GPUh
GEM [3]
50000 GPUh
Cosmos Predict1-14B [5]
Unknown
Cosmos Predict2-14B [5]
Unknown

Comparison with MAD-LTX-2B

MAD-LTX 2B
130 GPUh
Vista [4]
25000 GPUh
GEM [3]
50000 GPUh
Cosmos Predict1-7B [5]
Unknown
Cosmos Predict2-2B [5]
Unknown

MAD-SVD Results

Comparisons of MAD-SVD against VISTA [4] and GEM [3], all derived from the same Stable Video Diffusion [2] backbone. We achieve the same video generation quality using only a fraction of their training requirements, while the use of poses generally improves the visual and motion quality of the generated pedestrians and cyclists.

MAD-SVD
1500 GPUh
VISTA
25,000 GPUh
GEM
50,000 GPUh

MAD-LTX Controls

Our model accepts 3 types of control signals: the ego movement control which controls how the ego vehicle moves in the scene, the object movement control through which we can move objects of interest in the environment, and the text captions.

Ego Movement Control

Our motion forecaster accepts our ego-motion's representation as additional input to condition its generation on the desired ego movement. We extract this behaviour from ground truth video, which you can see by moving the vertical line. We then condition the generation of MAD-LTX 13B using this motion representation, and get our 'conditional' generation results. The 'unconditional' column shows videos generated by the same 13B model without having access to the ego movement. As you can see, the conditional generations follow the ego movement closely, while the unconditional generations show different ego movements.

Ground Truth
Conditional
Unconditional

Object Movement Control

We use bounding boxes to control the movement of other objects in the scene. Below you can see how we can make cars turn or move in the scene given our bounding boxes, while the unconditional model shows a different behaviour for that car. The conditional videos show the control bounding boxes (green) overlaid on the generation. As you can see, our model is able to follow the bounding box controls closely, leading to turning cars to left or keeping them at the desired position on the road.

Conditional (Control)
Unconditional

Text Control

Our MAD-LTX keeps the text prompt functionality of the Base LTX model [1], and adjusts it to driving scenarios. Below you can see examples of how our model is following these prompts in its generations.

Generated Video
Text Control

Ablation Studies

We conduct three ablation studies to support our main design choices: Firstly, we show the effect of the Noise Injection in appearance synthesizer training. Then, we compare two different intermediate representations, HDMaps and Panoptic Segmentations.

Noise Injection

The introduction of noise injection block to the model has made significant improvements in the visual quality of the cars that enter the scene (the first 2 examples), and details in the parked cars (last example) for which the predicted pose are noisy.

Noised
Unnoised

Panoptic Segmentation

Panoptic segmentation looses information about the cars, so we observe cars moving backward in the first 2 examples. It is also sometimes hard to distinguish different cars, leading to cars dissapearing or merging into others in crowded scenarios.

MAD-LTX
Segmentation Baseline

HDMap

Predicting in the HDMap space is not easy and harder to learn for the model. As a result there are many weird vehicle movements in the videos shown below. It is also sometimes hard to synthesize high quality cars given the 3D bounding boxes, especially for pedestrian.

MAD-LTX
HDMap Baseline

References

[1] HaCohen, Y., et al. "LTX-Video: Realtime Video Latent Diffusion." (2025).
[2] Blattmann, A., et al. "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." (2023).
[3] Hassan, M., et al. "GEM: A Generalizable Ego-Vision Multimodal World Model..." CVPR 2025.
[4] Gao, S., et al. "Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability." NeurIPS 2024.
[5] NVIDIA, et al. "Cosmos World Foundation Model Platform for Physical AI." (2025).