Skip to main content
ModelTerms

Multimodal · intermediate

Vision-Language Model (VLM)

A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.

Explanation

Most modern VLMs use a vision encoder (often a ViT) to turn each image into a sequence of patch tokens, project those into the language model's embedding space, and interleave with text tokens before feeding everything to the transformer.

CLIP pioneered the approach by training a vision and a text encoder jointly to align their embeddings. Modern chat VLMs (GPT-4o, Claude, Gemini, Llama 3.2 Vision) go further by training the language model end-to-end on multimodal data.

Examples

  • Asking a model "what is wrong with this UI screenshot?"
  • Reading a hand-written note and turning it into structured text.

Frequently asked

What is Vision-Language Model?

A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.

What is an example of vision-language model?

Asking a model "what is wrong with this UI screenshot?"

How is Vision-Language Model related to Multimodal?

Vision-Language Model and Multimodal are both multimodal concepts. A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.

Is Vision-Language Model considered intermediate?

Vision-Language Model is generally considered intermediate-level material in the AI and LLM space.

Side-by-side comparisons

Sources