Multimodal · intermediate
Vision-Language Model (VLM)
A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.
Explanation
Most modern VLMs use a vision encoder (often a ViT) to turn each image into a sequence of patch tokens, project those into the language model's embedding space, and interleave with text tokens before feeding everything to the transformer.
CLIP pioneered the approach by training a vision and a text encoder jointly to align their embeddings. Modern chat VLMs (GPT-4o, Claude, Gemini, Llama 3.2 Vision) go further by training the language model end-to-end on multimodal data.
Examples
- Asking a model "what is wrong with this UI screenshot?"
- Reading a hand-written note and turning it into structured text.
Frequently asked
What is Vision-Language Model?
A vision-language model processes both images and text. It can describe images, answer questions about them, and generate text grounded in visual input.
What is an example of vision-language model?
Asking a model "what is wrong with this UI screenshot?"
How is Vision-Language Model related to Multimodal?
Vision-Language Model and Multimodal are both multimodal concepts. A multimodal model processes more than one type of input — typically text plus images, sometimes adding audio, video, or 3D. GPT-4o, Claude, and Gemini are all multimodal.
Is Vision-Language Model considered intermediate?
Vision-Language Model is generally considered intermediate-level material in the AI and LLM space.