Skip to main content
ModelTerms

Evaluation · intermediate

ARC-AGI (ARC)

ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.

Explanation

Each task shows a few input/output grid examples and asks the model to apply the same transformation to a new input. The transformations are simple to humans but require true compositional reasoning — no shortcut from memorizing training data.

ARC-AGI was the hardest LLM benchmark for years. In late 2024, OpenAI o3 broke into the 75-87% range using massive test-time compute. The benchmark's creators promptly released ARC-AGI-2 — even harder.

The current leading signal for "is this model actually reasoning, or just retrieving?"

Examples

  • A grid puzzle where you must learn a transformation from 3 examples and apply it to a new grid.
  • o3 spending $100s of dollars of compute on a single hard ARC task and getting it right.

Frequently asked

What is ARC-AGI?

ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.

What is an example of arc-agi?

A grid puzzle where you must learn a transformation from 3 examples and apply it to a new grid.

How is ARC-AGI related to Benchmark?

ARC-AGI and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is ARC-AGI considered intermediate?

ARC-AGI is generally considered intermediate-level material in the AI and LLM space.

Side-by-side comparisons

Sources