State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis.
Each example shows the same text prompt (caption) and the corresponding videos from a standard text-to-video model (T2V) and our factorized model.
“A bee lands on a flower and a butterfly dances around the petals.”
“A pig cooks in a chef's hat and a chicken tastes the sauce.”
“A butterfly is flitting from one flower to the next, while a gardener is trimming the hedges carefully.”
“A parrot hosts a talk show and a duck claps.”
“A cat DJ at a party, a dog dances with glow sticks.”
“A frog plays drums on lily pads, a snail sways along with the beats.”
“A cat takes selfies while a dog rides a skateboard in the background.”
“A fish swims gracefully in a tank as a horse gallops outside.”
“A horse races alongside a moving train.”
“A timelapse of ice cream melting during a hot day.”
“A paper plane flying above a knitted vase.”
“A person sketches in a park and a chicken lays an egg.”
“A squirrel types on a laptop and a bat hangs from a charging cable.”
“A timelapse of a piece of metal gradually rusting when exposed to moisture.”
“A green rose gradually transitions to a shade of blue.”
“The icicle forms, going from a single droplet to a long, pointed shape as water freezes layer by layer.”
“The ink on the canvas changing from blue to green.”
Here we compare videos generated with a text-uppsampling variant of the T2V model to our factorized model.
“A cat DJ at a party, a dog dances with glow sticks.”
“A cat takes selfies while a dog rides a skateboard in the background.”
“A frog plays drums on lily pads, a snail sways along with the beats.”
“Five fish swim gracefully in the sea.”
“Five trucks move along the road.”
“Rabbit police officer directs traffic.”
“The flower is wiltin and losing its vibrant color.”
“The ink on the canvas changing from blue to green.”