Multimodal · intermediate

Diffusion Model (diffusion)

Diffusion models generate images (and now video, audio) by learning to reverse a step-by-step noising process. Starting from pure noise, they denoise back into a coherent sample.

Published May 29, 2026

Explanation

Training: take a clean image, progressively add noise across many steps, train a network to predict the noise added at each step. Inference: start from pure noise and iteratively run the denoiser in reverse, optionally conditioned on a text prompt.

Stable Diffusion, DALL-E, Midjourney, FLUX, and most image-generation services use diffusion (or a close relative like flow matching). Recent work has scaled diffusion to video (Sora, Veo) and audio (Suno, Udio).

Examples

Stable Diffusion generating an image from "a photo of an astronaut on a horse."
Sora generating short video clips from a text description.

Frequently asked

What is Diffusion Model?

Diffusion models generate images (and now video, audio) by learning to reverse a step-by-step noising process. Starting from pure noise, they denoise back into a coherent sample.

What is an example of diffusion model?

Stable Diffusion generating an image from "a photo of an astronaut on a horse."

How is Diffusion Model related to Generative AI?

Diffusion Model and Generative AI are both multimodal concepts. Generative AI refers to models that produce new content — text, images, audio, video, or code — rather than classifying or predicting from a fixed set of labels.

Is Diffusion Model considered intermediate?

Diffusion Model is generally considered intermediate-level material in the AI and LLM space.

Generative AIFoundations

Generative AI refers to models that produce new content — text, images, audio, video, or code — rather than classifying or predicting from a fixed set of labels.

MultimodalMultimodal

A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

EmbeddingArchitecture

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

Side-by-side comparisons

Sources

Denoising Diffusion Probabilistic Models (arXiv)