User Input (Prompt)
Everything starts with text. The text prompt is the gateway to the AI's imagination. The user describes the desired image as accurately as possible, and the model uses this sentence as a guideline throughout the process.
In this experiment, we task the AI with creating a retrofuturistic landscape.
Text Embedding (CLIP Text Encoder)
AI doesn't understand human language by just looking at words. First, the text you provide is run through a language model (often called CLIP). This model is trained to understand the connections between images and text.
It breaks down the words and translates them into hundreds of numbers representing dimensions, known as vectors. This is how the computer builds a purely mathematical understanding of what "neon", "space", or "cyberpunk" means visually.
Denoising (Latent Space & U-Net)
This is the core of the diffusion model. Generation begins with pure digital noise (from seed). The AI model doesn't paint the picture from scratch; it carves it out of the noise.
The U-Net neural network passes over the image numerous times (steps), removing noise bit by bit. It uses the previously created CLIP text vector as its "map" so that shapes matching your original text start to emerge from underneath the noise.
Decoding (VAE)
Until now, all denoising has taken place in the AI's "hidden state" (Latent Space). It is a highly compressed space that the human eye cannot access – this way, the model operates the image mathematically much faster and more efficiently.
Now that the raw core of the image is ready in its latent state, the Variational Autoencoder (VAE). Its decoder takes that compressed data and scales it up into a large, full-resolution image, converting the numbers into actual RGB pixels that we see as the final artwork.
Final Output
After dozens or even hundreds of steps, the noise has been refined into a completely new, unique image.
The AI did not copy this artwork from the internet or piece it together from existing photographs; it built it itself, pixel by pixel, according to its algorithm and your instructions.