Skip to main content
ModelTerms

Multimodal · intermediate

Diffusion Model (diffusion)

Diffusion models generate images (and now video, audio) by learning to reverse a step-by-step noising process. Starting from pure noise, they denoise back into a coherent sample.

Explanation

Training: take a clean image, progressively add noise across many steps, train a network to predict the noise added at each step. Inference: start from pure noise and iteratively run the denoiser in reverse, optionally conditioned on a text prompt.

Stable Diffusion, DALL-E, Midjourney, FLUX, and most image-generation services use diffusion (or a close relative like flow matching). Recent work has scaled diffusion to video (Sora, Veo) and audio (Suno, Udio).

Examples

  • Stable Diffusion generating an image from "a photo of an astronaut on a horse."
  • Sora generating short video clips from a text description.

Frequently asked

What is Diffusion Model?

Diffusion models generate images (and now video, audio) by learning to reverse a step-by-step noising process. Starting from pure noise, they denoise back into a coherent sample.

What is an example of diffusion model?

Stable Diffusion generating an image from "a photo of an astronaut on a horse."

How is Diffusion Model related to Generative AI?

Diffusion Model and Generative AI are both multimodal concepts. Generative AI refers to models that produce new content — text, images, audio, video, or code — rather than classifying or predicting from a fixed set of labels.

Is Diffusion Model considered intermediate?

Diffusion Model is generally considered intermediate-level material in the AI and LLM space.

Side-by-side comparisons

Sources