ModelTerms / Categories / MultimodalCategoryMultimodalBeyond text — images, audio, and video.Diffusion ModelDiffusion models generate images (and now video, audio) by learning to reverse a step-by-step noising process. Starting from pure noise, they denoise back into a coherent sample.intermediateMultimodalA multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.beginnerVision-Language ModelA vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.intermediate