Evaluation · intermediate
ARC-AGI (ARC)
ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.
Explanation
Each task shows a few input/output grid examples and asks the model to apply the same transformation to a new input. The transformations are simple to humans but require true compositional reasoning — no shortcut from memorizing training data.
ARC-AGI was the hardest LLM benchmark for years. In late 2024, OpenAI o3 broke into the 75-87% range using massive test-time compute. The benchmark's creators promptly released ARC-AGI-2 — even harder.
The current leading signal for "is this model actually reasoning, or just retrieving?"
Examples
- A grid puzzle where you must learn a transformation from 3 examples and apply it to a new grid.
- o3 spending $100s of dollars of compute on a single hard ARC task and getting it right.
Frequently asked
What is ARC-AGI?
ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.
What is an example of arc-agi?
A grid puzzle where you must learn a transformation from 3 examples and apply it to a new grid.
How is ARC-AGI related to Benchmark?
ARC-AGI and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
Is ARC-AGI considered intermediate?
ARC-AGI is generally considered intermediate-level material in the AI and LLM space.