We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively “dressing” the motion with texture and lighting.
Our experiments show this decoupled approach is exceptionally efficient: adapting SVD [2], we match prior SOTA models with less than 6% of their compute. Scaling to LTX [1], our MAD-LTX model outperforms all open-source competitors.
Our Appearance Synthesizer effectively "dresses" the sparse predicted motion signals with realistic textures and lighting. You can verify the alignment between predicted motion (skeleton) and synthesized appearance (RGB) by moving the vertical slider in each video.
A key capability of our Motion Forecaster is the synthesis of agents entering the scene that were not present in the initial frame. The model predicts plausible trajectories for new vehicles and pedestrians, which are then seamlessly rendered by the Appearance Synthesizer, enriching the scene's dynamism.
While our pose predictor is trained specifically on cars and pedestrians, it exhibits an implicit understanding of unmodeled occluders, such as trucks. As shown below, the model avoids predicting skeletons where a truck is present (Video 1) or correctly handles the occlusion when a truck enters the scene (Video 2), ensuring that the Appearance Synthesizer does not render conflicting geometry in those regions.
Pedestrians often present a challenge for driving world model due to their non-rigid motion. By conditioning on explicit skeleton sequences, MAD-LTX generates pedestrians with high temporal consistency and realistic gait.
The generations of MAD-LTX are compared to other baselines, all given the same starting frame and caption (for models that accept captions). We show comparisons at two scales: 13B and 2B parameters. Note that MAD-LTX is trained with significantly less computational resources than the baselines. The first 3 examples show cases where our models generations are clearly better. We have also improved pedestrian motion compared to prior work, while still there is room for improvement compared to Cosmos Predict 2 [5].
Comparisons of MAD-SVD against VISTA [4] and GEM [3], all derived from the same Stable Video Diffusion [2] backbone. We achieve the same video generation quality using only a fraction of their training requirements, while the use of poses generally improves the visual and motion quality of the generated pedestrians and cyclists.
Our model accepts 3 types of control signals: the ego movement control which controls how the ego vehicle moves in the scene, the object movement control through which we can move objects of interest in the environment, and the text captions.
Our motion forecaster accepts our ego-motion's representation as additional input to condition its generation on the desired ego movement. We extract this behaviour from ground truth video, which you can see by moving the vertical line. We then condition the generation of MAD-LTX 13B using this motion representation, and get our 'conditional' generation results. The 'unconditional' column shows videos generated by the same 13B model without having access to the ego movement. As you can see, the conditional generations follow the ego movement closely, while the unconditional generations show different ego movements.
We use bounding boxes to control the movement of other objects in the scene. Below you can see how we can make cars turn or move in the scene given our bounding boxes, while the unconditional model shows a different behaviour for that car. The conditional videos show the control bounding boxes (green) overlaid on the generation. As you can see, our model is able to follow the bounding box controls closely, leading to turning cars to left or keeping them at the desired position on the road.
Our MAD-LTX keeps the text prompt functionality of the Base LTX model [1], and adjusts it to driving scenarios. Below you can see examples of how our model is following these prompts in its generations.
We conduct three ablation studies to support our main design choices: Firstly, we show the effect of the Noise Injection in appearance synthesizer training. Then, we compare two different intermediate representations, HDMaps and Panoptic Segmentations.
The introduction of noise injection block to the model has made significant improvements in the visual quality of the cars that enter the scene (the first 2 examples), and details in the parked cars (last example) for which the predicted pose are noisy.
Panoptic segmentation looses information about the cars, so we observe cars moving backward in the first 2 examples. It is also sometimes hard to distinguish different cars, leading to cars dissapearing or merging into others in crowded scenarios.
Predicting in the HDMap space is not easy and harder to learn for the model. As a result there are many weird vehicle movements in the videos shown below. It is also sometimes hard to synthesize high quality cars given the 3D bounding boxes, especially for pedestrian.