Multimodal · intermediate

Vision-Language Model (VLM)

A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.

Published May 29, 2026

Explanation

Most modern VLMs use a vision encoder (often a ViT) to turn each image into a sequence of patch tokens, project those into the language model's embedding space, and interleave with text tokens before feeding everything to the transformer.

CLIP pioneered the approach by training a vision and a text encoder jointly to align their embeddings. Modern chat VLMs (GPT-4o, Claude, Gemini, Llama 3.2 Vision) go further by training the language model end-to-end on multimodal data.

Examples

Asking a model "what is wrong with this UI screenshot?"
Reading a hand-written note and turning it into structured text.

Frequently asked

What is Vision-Language Model?

A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.

What is an example of vision-language model?

Asking a model "what is wrong with this UI screenshot?"

How is Vision-Language Model related to Multimodal?

Vision-Language Model and Multimodal are both multimodal concepts. A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

Is Vision-Language Model considered intermediate?

Vision-Language Model is generally considered intermediate-level material in the AI and LLM space.

MultimodalMultimodal

A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

EmbeddingArchitecture

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

Large Language ModelFoundations

A large language model is a neural network trained on huge amounts of text to predict the next token in a sequence. GPT-4, Claude, and Gemini are all LLMs.

Side-by-side comparisons

Sources

CLIP paper (arXiv)