Multimodal

Beyond text — images, audio, and video.

Diffusion models generate images (and now video, audio) by learning to reverse a step-by-step noising process. Starting from pure noise, they denoise back into a coherent sample.

intermediate

Multimodal

A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

beginner

Vision-Language Model

A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.

intermediate