User Input (Prompt)
All video generation begins with a script. The user describes the situation, movement, style, and environment using text.
The AI has to understand not only the objects in the image, but also their physical properties, gravity, and the physics of movement according to the desired prompt.
"Camera flies slowly towards a glowing cyberpunk city, rain hitting the neon-lit streets. Epic atmosphere."
Spatial and Temporal Abstraction
Unlike image generators (which process flat 2D data), video generators must work with three-dimensional data (X, Y, and Time).
The CLIP vectors derived from your words are combined inside the model into so-called Spatio-temporal "patches". This gives the AI a map of what is where at the beginning, where it is going, and when.
3D Denoising (Video Diffusion)
The AI generates the video from a massive volume of pure, animating "TV noise." The generator starts working on all seconds simultaneously!
By gradually removing noise, the AI uses a 3D U-Net or Transformer architecture to create precise motion paths. You will notice how the figures start gliding in the correct direction dynamically.
Temporal Consistency
The biggest challenge in generating video is preventing things from shifting or morphing between frames. You've probably seen early AI videos where a character's coat or eyes constantly change!
At this stage, the model's Attention block ensures that elements (such as lighting, reflections from the rain, and backgrounds) precisely follow the preceding and succeeding frames in one vast "flow network".
Decoding (Spatio-Temporal VAE)
The AI's memory and processing power are not sufficient to handle millions of pixels directly across hundreds of frames. That's why all this magic has taken place in a small, compressed "latent" space.
In the case of video models, a separate Spatio-Temporal VAE (Video Autoencoder) is required to decode the shrunken data back into high-resolution pixels—both in width and height, and temporally into video frames.
Final Generated Video
After thousands of parallel neural network computations, the 4D information drawn in the model's latent states is "decoded" into normal RGB pixels and ~30fps video for humans to view.
The most incredible thing about video AI is that it "simulates" a fully physical world entirely without traditional 3D programming, solely based on what it has learned.