Inference · beginner

Batch API (batch inference)

Batch APIs (OpenAI, Anthropic) accept up to 50K LLM requests in a single submission, run them asynchronously over hours, and return results at ~50% of the synchronous price. The cheap option for bulk processing.

Published May 31, 2026

Explanation

For workloads that don't need real-time responses — generating embeddings for a million documents, evaluating a benchmark, running synthetic data generation, regrading a corpus — providers offer a batch tier at ~50% off.

The catch: latency goes from sub-second to "within 24 hours." For embeddings or eval that's fine; for user-facing chat it's a non-starter.

OpenAI launched the Batch API in April 2024; Anthropic followed with Message Batches; Google Vertex has long offered batch prediction. Spec is similar: submit a JSONL of requests, poll for completion, download the JSONL of responses.

Examples

Generating embeddings for 10M support tickets via OpenAI Batch: $0.05 / 1M tokens instead of $0.10, completed overnight.
Running an eval suite of 50K traces through GPT-4o Batch for a fraction of synchronous cost.

When to use batch api

Any time the work is bulk, async, and not user-facing — embedding pipelines, evals, synthetic data, batch labeling.

Frequently asked

What is Batch API?

What is an example of batch api?

Generating embeddings for 10M support tickets via OpenAI Batch: $0.05 / 1M tokens instead of $0.10, completed overnight.

How is Batch API related to Inference?

Batch API and Inference are both inference concepts. Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

When should I use batch api?

Any time the work is bulk, async, and not user-facing — embedding pipelines, evals, synthetic data, batch labeling.

Is Batch API considered beginner?

Batch API is generally considered beginner-level material in the AI and LLM space.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

EmbeddingArchitecture

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

Synthetic DataTraining

Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.

Batch API (batch inference)

Explanation

Examples

When to use batch api

Frequently asked

What is Batch API?

What is an example of batch api?

How is Batch API related to Inference?

When should I use batch api?

Is Batch API considered beginner?

Side-by-side comparisons

Sources

Explanation

Examples

When to use batch api

Frequently asked

What is Batch API?

What is an example of batch api?

How is Batch API related to Inference?

When should I use batch api?

Is Batch API considered beginner?

Related terms

Side-by-side comparisons

Sources