Skip to main content
ModelTerms

Inference · beginner

Batch API (batch inference)

Batch APIs (OpenAI, Anthropic) accept up to 50K LLM requests in a single submission, run them asynchronously over hours, and return results at ~50% of the synchronous price. The cheap option for bulk processing.

Explanation

For workloads that don't need real-time responses — generating embeddings for a million documents, evaluating a benchmark, running synthetic data generation, regrading a corpus — providers offer a batch tier at ~50% off.

The catch: latency goes from sub-second to "within 24 hours." For embeddings or eval that's fine; for user-facing chat it's a non-starter.

OpenAI launched the Batch API in April 2024; Anthropic followed with Message Batches; Google Vertex has long offered batch prediction. Spec is similar: submit a JSONL of requests, poll for completion, download the JSONL of responses.

Examples

  • Generating embeddings for 10M support tickets via OpenAI Batch: $0.05 / 1M tokens instead of $0.10, completed overnight.
  • Running an eval suite of 50K traces through GPT-4o Batch for a fraction of synchronous cost.

When to use batch api

Any time the work is bulk, async, and not user-facing — embedding pipelines, evals, synthetic data, batch labeling.

Frequently asked

What is Batch API?

Batch APIs (OpenAI, Anthropic) accept up to 50K LLM requests in a single submission, run them asynchronously over hours, and return results at ~50% of the synchronous price. The cheap option for bulk processing.

What is an example of batch api?

Generating embeddings for 10M support tickets via OpenAI Batch: $0.05 / 1M tokens instead of $0.10, completed overnight.

How is Batch API related to Inference?

Batch API and Inference are both inference concepts. Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

When should I use batch api?

Any time the work is bulk, async, and not user-facing — embedding pipelines, evals, synthetic data, batch labeling.

Is Batch API considered beginner?

Batch API is generally considered beginner-level material in the AI and LLM space.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

EmbeddingArchitecture

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

Synthetic DataTraining

Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.

Side-by-side comparisons

Sources