Skip to main content
ModelTerms

Multimodal · beginner

Multimodal (multi-modal)

A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

Explanation

In a multimodal LLM, non-text inputs are encoded into the same embedding space as text tokens (often by a vision encoder for images), then fed to the same transformer. The model can answer questions about pictures, transcribe audio, or generate text from video frames.

Truly natively multimodal models are trained from scratch on mixed data, while adapted multimodal models bolt a vision encoder onto an existing LLM via a projection layer. Both work; native tends to be more capable on complex cross-modal tasks.

Examples

  • GPT-4o describing a photo.
  • Gemini answering questions about a video.
  • Claude reading a screenshot of a UI.

Frequently asked

What is Multimodal?

A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

What is an example of multimodal?

GPT-4o describing a photo.

How is Multimodal related to Vision-Language Model?

Multimodal and Vision-Language Model are both multimodal concepts. A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.

Is Multimodal considered beginner?

Multimodal is generally considered beginner-level material in the AI and LLM space.

Side-by-side comparisons

Sources