✨ Factorized Video Generation Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan*, Bastien Van Delft *, Wuyang Li, Alexandre Alahi
École Polytechnique Fédérale de Lausanne (EPFL),
* Equal Contributors   

Abstract

State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis.


Text-to-Video vs Factorized Video Generation

Each example shows the same text prompt (caption) and the corresponding videos from a standard text-to-video model (T2V) and our factorized model.

A bee lands on a flower and a butterfly dances around the petals.

Wan2.2 T2V

Wan2.2 Factorized

“A pig cooks in a chef's hat and a chicken tastes the sauce.

Wan2.2 T2V

Wan2.2 Factorized

“A butterfly is flitting from one flower to the next, while a gardener is trimming the hedges carefully.”

Wan2.2 T2V

Wan2.2 Factorized

“A parrot hosts a talk show and a duck claps.

Wan2.2 T2V

Wan2.2 Factorized

“A cat DJ at a party, a dog dances with glow sticks.

Wan2.2 T2V

Wan2.2 Factorized

“A frog plays drums on lily pads, a snail sways along with the beats.

Wan2.2 T2V

Wan2.2 Factorized

“A cat takes selfies while a dog rides a skateboard in the background.”

Wan2.2 T2V

Wan2.2 Factorized

“A fish swims gracefully in a tank as a horse gallops outside.

Wan2.2 T2V

Wan2.2 Factorized

“A horse races alongside a moving train.

Wan2.2 T2V

Wan2.2 Factorized

“A timelapse of ice cream melting during a hot day.”

Wan2.2 T2V

Wan2.2 Factorized

“A paper plane flying above a knitted vase.

Wan2.2 T2V

Wan2.2 Factorized

“A person sketches in a park and a chicken lays an egg.

Wan2.2 T2V

Wan2.2 Factorized

“A squirrel types on a laptop and a bat hangs from a charging cable.

Wan2.2 T2V

Wan2.2 Factorized

“A timelapse of a piece of metal gradually rusting when exposed to moisture.”

Wan2.2 T2V

Wan2.2 Factorized

A green rose gradually transitions to a shade of blue.

Wan2.2 T2V

Wan2.2 Factorized

“The icicle forms, going from a single droplet to a long, pointed shape as water freezes layer by layer.”

Wan2.2 T2V

Wan2.2 Factorized

“The ink on the canvas changing from blue to green.

Wan2.2 T2V

Wan2.2 Factorized

Text Upsampling vs Factorized Video Generation

Here we compare videos generated with a text-uppsampling variant of the T2V model to our factorized model.

“A cat DJ at a party, a dog dances with glow sticks.

Text Upsampled Wan2.2 T2V

Wan2.2 Factorized

“A cat takes selfies while a dog rides a skateboard in the background.”

Text Upsampled Wan2.2 T2V

Wan2.2 Factorized

“A frog plays drums on lily pads, a snail sways along with the beats.”

Text Upsampled Wan2.2 T2V

Wan2.2 Factorized

Five fish swim gracefully in the sea.”

Text Upsampled Wan2.2 T2V

Wan2.2 Factorized

Five trucks move along the road.”

Text Upsampled Wan2.2 T2V

Wan2.2 Factorized

Rabbit police officer directs traffic.

Text Upsampled Wan2.2 T2V

Wan2.2 Factorized

“The flower is wiltin and losing its vibrant color.

Text Upsampled Wan2.2 T2V

Wan2.2 Factorized

“The ink on the canvas changing from blue to green.

Text Upsampled Wan2.2 T2V

Wan2.2 Factorized