MAD: Motion Appearance Decoupling

MAD about Efficiency?

We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively “dressing” the motion with texture and lighting.

Key Findings of MAD

Our experiments show this decoupled approach is exceptionally efficient: adapting SVD [2], we match prior SOTA models with less than 6% of their compute. Scaling to LTX [1], our MAD-LTX model outperforms all open-source competitors.

Proof-of-Concept: MAD-SVD

Reproduces VISTA [4] performance starting from the same backbone Stable Video Diffusion (SVD) [2].
Using 12× less data (139 hrs vs. 1,700).
Using 16× less compute (1,500 GPU-hrs vs. 25,000).

Scaling to SOTA: MAD-LTX 2B & 13B

Extremely low training cost: 130 GPU-hrs (2B) and 700 GPU-hrs (13B), using only 130 hours of driving video.
SOTA performance among open-source models for video generation quality.
Outperforms standard LTX [1] training on both video generation and open-loop planning.
Faster inference than previous models, with support for text, ego trajectory, and object control.

Our MAD-LTX Generations

High-Fidelity Appearance Transfer

Our Appearance Synthesizer effectively "dresses" the sparse predicted motion signals with realistic textures and lighting. You can verify the alignment between predicted motion (skeleton) and synthesized appearance (RGB) by moving the vertical slider in each video.

New Agent Generation

A key capability of our Motion Forecaster is the synthesis of agents entering the scene that were not present in the initial frame. The model predicts plausible trajectories for new vehicles and pedestrians, which are then seamlessly rendered by the Appearance Synthesizer, enriching the scene's dynamism.

Implicit Occlusion Modeling

While our pose predictor is trained specifically on cars and pedestrians, it exhibits an implicit understanding of unmodeled occluders, such as trucks. As shown below, the model avoids predicting skeletons where a truck is present (Video 1) or correctly handles the occlusion when a truck enters the scene (Video 2), ensuring that the Appearance Synthesizer does not render conflicting geometry in those regions.

Robust Pedestrian Synthesis

Pedestrians often present a challenge for driving world model due to their non-rigid motion. By conditioning on explicit skeleton sequences, MAD-LTX generates pedestrians with high temporal consistency and realistic gait.

Comparison with other Driving World Models

The generations of MAD-LTX are compared to other baselines, all given the same starting frame and caption (for models that accept captions). We show comparisons at two scales: 13B and 2B parameters. Note that MAD-LTX is trained with significantly less computational resources than the baselines. The first 3 examples show cases where our models generations are clearly better. We have also improved pedestrian motion compared to prior work, while still there is room for improvement compared to Cosmos Predict 2 [5].

Comparison with MAD-LTX-13B

MAD-LTX 13B

700 GPUh

Vista [4]

25000 GPUh

GEM [3]

50000 GPUh

Cosmos Predict1-14B [5]

Unknown

Cosmos Predict2-14B [5]

Unknown

Comparison with MAD-LTX-2B

MAD-LTX 2B

130 GPUh

Vista [4]

25000 GPUh

GEM [3]

50000 GPUh

Cosmos Predict1-7B [5]

Unknown

Cosmos Predict2-2B [5]

Unknown

MAD-SVD Results

Comparisons of MAD-SVD against VISTA [4] and GEM [3], all derived from the same Stable Video Diffusion [2] backbone. We achieve the same video generation quality using only a fraction of their training requirements, while the use of poses generally improves the visual and motion quality of the generated pedestrians and cyclists.

MAD-SVD

1500 GPUh

VISTA

25,000 GPUh

GEM

50,000 GPUh

MAD-LTX Controls

Our model accepts 3 types of control signals: the ego movement control which controls how the ego vehicle moves in the scene, the object movement control through which we can move objects of interest in the environment, and the text captions.

Ego Movement Control

Our motion forecaster accepts our ego-motion's representation as additional input to condition its generation on the desired ego movement. We extract this behaviour from ground truth video, which you can see by moving the vertical line. We then condition the generation of MAD-LTX 13B using this motion representation, and get our 'conditional' generation results. The 'unconditional' column shows videos generated by the same 13B model without having access to the ego movement. As you can see, the conditional generations follow the ego movement closely, while the unconditional generations show different ego movements.

Ground Truth

Conditional

Unconditional

Object Movement Control

We use bounding boxes to control the movement of other objects in the scene. Below you can see how we can make cars turn or move in the scene given our bounding boxes, while the unconditional model shows a different behaviour for that car. The conditional videos show the control bounding boxes (green) overlaid on the generation. As you can see, our model is able to follow the bounding box controls closely, leading to turning cars to left or keeping them at the desired position on the road.

Conditional (Control)

Unconditional

Text Control

Our MAD-LTX keeps the text prompt functionality of the Base LTX model [1], and adjusts it to driving scenarios. Below you can see examples of how our model is following these prompts in its generations.

Generated Video

Text Control

Ablation Studies

We conduct three ablation studies to support our main design choices: Firstly, we show the effect of the Noise Injection in appearance synthesizer training. Then, we compare two different intermediate representations, HDMaps and Panoptic Segmentations.

Noise Injection

The introduction of noise injection block to the model has made significant improvements in the visual quality of the cars that enter the scene (the first 2 examples), and details in the parked cars (last example) for which the predicted pose are noisy.

Noised

Unnoised

Panoptic Segmentation

Panoptic segmentation looses information about the cars, so we observe cars moving backward in the first 2 examples. It is also sometimes hard to distinguish different cars, leading to cars dissapearing or merging into others in crowded scenarios.

MAD-LTX

Segmentation Baseline

HDMap

Predicting in the HDMap space is not easy and harder to learn for the model. As a result there are many weird vehicle movements in the videos shown below. It is also sometimes hard to synthesize high quality cars given the 3D bounding boxes, especially for pedestrian.

MAD-LTX

HDMap Baseline

References

[1] HaCohen, Y., et al. "LTX-Video: Realtime Video Latent Diffusion." (2025).

[2] Blattmann, A., et al. "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." (2023).

[3] Hassan, M., et al. "GEM: A Generalizable Ego-Vision Multimodal World Model..." CVPR 2025.

[4] Gao, S., et al. "Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability." NeurIPS 2024.

[5] NVIDIA, et al. "Cosmos World Foundation Model Platform for Physical AI." (2025).

MAD: Motion Appearance Decouplingfor efficient Driving World Models

Key Findings of MAD

Our MAD-LTX Generations

High-Fidelity Appearance Transfer

New Agent Generation

Implicit Occlusion Modeling

Robust Pedestrian Synthesis

Comparison with other Driving World Models

Comparison with MAD-LTX-13B

Comparison with MAD-LTX-2B

MAD-SVD Results

MAD-LTX Controls

Ego Movement Control

Object Movement Control

Text Control

Ablation Studies

Noise Injection

Panoptic Segmentation

HDMap

References

MAD: Motion Appearance Decoupling
for efficient Driving World Models