Evaluation · intermediate

ARC-AGI (ARC)

ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.

Published May 30, 2026

Explanation

Each task shows a few input/output grid examples and asks the model to apply the same transformation to a new input. The transformations are simple to humans but require true compositional reasoning — no shortcut from memorizing training data.

ARC-AGI was the hardest LLM benchmark for years. In late 2024, OpenAI o3 broke into the 75-87% range using massive test-time compute. The benchmark's creators promptly released ARC-AGI-2 — even harder.

The current leading signal for "is this model actually reasoning, or just retrieving?"

Examples

A grid puzzle where you must learn a transformation from 3 examples and apply it to a new grid.
o3 spending $100s of dollars of compute on a single hard ARC task and getting it right.

Frequently asked

What is ARC-AGI?

ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.

What is an example of arc-agi?

A grid puzzle where you must learn a transformation from 3 examples and apply it to a new grid.

How is ARC-AGI related to Benchmark?

ARC-AGI and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is ARC-AGI considered intermediate?

ARC-AGI is generally considered intermediate-level material in the AI and LLM space.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Reasoning ModelArchitecture

A reasoning model spends extra compute thinking step-by-step before answering. OpenAI o1/o3, DeepSeek R1, and Anthropic's extended thinking are reasoning models.

Test-Time ComputePrompting

Test-time compute is the extra reasoning, sampling, or search a model can do at inference time to improve quality — more thinking tokens, more candidate answers, or verifier-guided search.

MMLUEvaluation

MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

ARC-AGI (ARC)

Explanation

Examples

Frequently asked

What is ARC-AGI?

What is an example of arc-agi?

How is ARC-AGI related to Benchmark?

Is ARC-AGI considered intermediate?

Side-by-side comparisons

Sources

Explanation

Examples

Frequently asked

What is ARC-AGI?

What is an example of arc-agi?

How is ARC-AGI related to Benchmark?

Is ARC-AGI considered intermediate?

Related terms

Side-by-side comparisons

Sources