Multimodal · beginner

Multimodal (multi-modal)

A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

Published May 29, 2026

Explanation

In a multimodal LLM, non-text inputs are encoded into the same embedding space as text tokens (often by a vision encoder for images), then fed to the same transformer. The model can answer questions about pictures, transcribe audio, or generate text from video frames.

Truly natively multimodal models are trained from scratch on mixed data, while adapted multimodal models bolt a vision encoder onto an existing LLM via a projection layer. Both work; native tends to be more capable on complex cross-modal tasks.

Examples

GPT-4o describing a photo.
Gemini answering questions about a video.
Claude reading a screenshot of a UI.

Frequently asked

What is Multimodal?

A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

What is an example of multimodal?

GPT-4o describing a photo.

How is Multimodal related to Vision-Language Model?

Multimodal and Vision-Language Model are both multimodal concepts. A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.

Is Multimodal considered beginner?

Multimodal is generally considered beginner-level material in the AI and LLM space.

Vision-Language ModelMultimodal

A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.

Large Language ModelFoundations

A large language model is a neural network trained on huge amounts of text to predict the next token in a sequence. GPT-4, Claude, and Gemini are all LLMs.

EmbeddingArchitecture

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

Diffusion ModelMultimodal

Diffusion models generate images (and now video, audio) by learning to reverse a step-by-step noising process. Starting from pure noise, they denoise back into a coherent sample.

Side-by-side comparisons

Sources

GPT-4 paper (arXiv)