Infrastructure · intermediate

LLM Gateway (model gateway, AI proxy)

An LLM gateway is a proxy layer that sits between application code and one or more LLM providers — handling auth, rate-limit retries, cost tracking, observability, prompt caching, model routing, and PII redaction.

Published May 31, 2026

Explanation

Production LLM apps quickly accumulate cross-cutting concerns: which provider has spare quota, how to retry on 429s, how to track per-tenant cost, how to redact PII before sending, how to log every call, how to A/B test models. An LLM gateway centralizes them.

Popular implementations: LiteLLM (open-source proxy + SDK), Portkey, Helicone, Cloudflare AI Gateway, Kong AI Gateway, AWS Bedrock's built-in gateway features.

Application code points at the gateway with one base URL; the gateway handles fallbacks, routing, observability, and policies behind it. Trade-off: another hop in the latency budget (~10-50ms added).

Examples

LiteLLM proxy fronting OpenAI, Anthropic, and Bedrock with unified billing dashboards and automatic retry-on-fallback.
A multi-tenant SaaS routing all LLM calls through Cloudflare AI Gateway for caching + analytics.

When to use llm gateway

When you use multiple providers, need per-tenant cost attribution, or want centralized observability/PII policies.

Frequently asked

What is LLM Gateway?

What is an example of llm gateway?

LiteLLM proxy fronting OpenAI, Anthropic, and Bedrock with unified billing dashboards and automatic retry-on-fallback.

How is LLM Gateway related to LLM Observability?

LLM Gateway and LLM Observability are both infrastructure concepts. LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

When should I use llm gateway?

When you use multiple providers, need per-tenant cost attribution, or want centralized observability/PII policies.

Is LLM Gateway considered intermediate?

LLM Gateway is generally considered intermediate-level material in the AI and LLM space.

LLM ObservabilityInfrastructure

LLM observability is the practice of capturing, analyzing, and acting on every LLM call in a production system — inputs, outputs, latencies, costs, errors, and quality scores — so you can debug regressions and improve quality over time.

Model RouterInfrastructure

A model router picks the cheapest model that's likely to handle a given request well — based on a small classifier, embedding similarity, or rule-based filters — so you don't pay frontier prices for trivial queries.

Prompt CachingInference

Prompt caching stores the KV-cache state of a long prefix (system prompt, large document, tool definitions) so subsequent calls that reuse it skip the prefill compute — cutting TTFT and cost by 50-90%.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

Side-by-side comparisons

Sources

LiteLLM docs