State of the art,
measured in the
real world.

ARC is the independent benchmark VLA labs run before they ship to customers. We measure the cognitive brain — System 2, after RoboBench (PKU/BAAI) — and the motor policy — System 1, after VLABench (ICCV ′25) — on the same physical rollout, then fuzz the result with the ten perturbation operators of VLATest (FSE ′25). Every score is signed, replayable, and tied to a hardware-of-record you can audit. No simulators. No cherry-picked episodes. One number per model, defensible on a customer call.

Custom tasks

VLA models · live

2.4k

Hours bench-time

v0.4

Open protocol

Submit a model See leaderboard

§ 01

The leaderboard.
Brain · policy · robust · industrial.

Last sync 02:14 UTC
Cycle 2026·Q2
CI 95% reported

RankModelArchitectureBrainPolicyRobustIndustrialARC Composite

Helix-2Figure · 8.4 B · 2026·02

Hierarchical S2/S1 VLA58.418.247.171.0

41.6

π0.5Physical Intelligence · 3.3 B · 2025·11

Flow-matching VLA54.116.444.064.5

38.7

Gemini-Robotics 1.5Google DeepMind · — · 2025·10

Multimodal VLA62.911.241.556.0

36.4

GR00T N1.5NVIDIA · 12 B · 2025·09

Dual-system VLA51.013.639.860.2

35.1

RoboBrain-2.0BAAI · 7 B · 2025·08

Embodied-finetuned MLLM + π adapter56.89.636.248.0

31.9

π0-FastPhysical Intelligence · 3.0 B · 2025·05

Discretized flow VLA47.29.235.752.4

29.8

OpenVLA-7BStanford / TRI · 7 B · 2024·09

LLaMA-2 backbone41.66.432.144.0

25.3

RDT-1BTsinghua · 1.2 B · 2024·10

Diffusion transformer36.26.028.439.1

22.4

Octo-baseBerkeley AI · 1.3 B · 2024·06

Diffusion transformer28.12.024.332.0

16.9

[ Composite weights — 0.25 brain · 0.35 policy · 0.25 robust · 0.15 industrial ]→ ACADEMIC PRECEDENT BELOW

§ 02

Six axioms for honest
robotics evaluation.

Spec v0.4.2
Open protocol
CC-BY-SA

AX–01

Two systems, one score.

We grade the cognitive brain (System 2) and the motor policy (System 1) on the same rollout — RoboBench-style symbolic fidelity above VLABench-style execution success.

/ Dual-system evaluation

AX–02

Adversarial by construction.

Every rollout perturbs the environment along the ten VLATest operators — confounding objects, lighting, camera pose, paraphrased instructions, OOD substitutions. Memorization fails by design.

/ Robustness primary

AX–03

MLLM-as-world-simulator scoring.

Plans are rolled out through an MLLM-simulated world with human-annotated state DAGs — Pearson-validated against 20 expert annotators. No BLEU, no LLM-pairwise theatre.

/ Statistical fidelity

AX–04

Open protocol, real hardware.

Two calibrated physical benches. Hardware ref, scoring rubric, rollout videos, failure-DAG annotations are all public. Any lab can replicate the suite. CC-BY-SA.

/ Reproducibility

AX–05

Continuous re-testing.

Models are re-evaluated on every checkpoint. The leaderboard reflects current capability — not the cherry-picked snapshot you submitted six months ago.

/ Live evaluation

AX–06

Failure is the product.

Each run emits a typed failure trace — execution / identification / common-sense / mode-specific (per RoboBench §5.3). Auditable, regression-comparable, deployment-grade.

/ Failure taxonomy

§ 03

Eight custom tasks.
Mapped to academic dimensions.

Physical bench A · B
16-camera capture · 120 Hz proprio
10 VLATest operators per rollout

T-01

Cluttered Pantry

↳ Perception Reasoning · RoboBench↳ Mesh & Texture · VLABench

Retrieve a specified item from a densely populated shelf containing visually similar distractors. Adversarial confounders, paraphrased instructions, drifting lighting between rollouts.

ManipulationPerceptionConfounders

DIFF

T-02

Translate-and-Fold

↳ Instruction Comprehension · RoboBench↳ Semantic Adaptation · VLABench T4

Receive a paraphrased natural-language instruction (5 mutations per rollout, mixed languages) and fold a deformable object accordingly. Probes language grounding and instruction robustness.

LanguageDeformablesParaphrase

DIFF

T-03

Cafeteria Tray

↳ Generalized Planning · RoboBench↳ Long-Horizon · VLABench T6

Plan and execute a 9-step assembly: open dispensers, portion, plate, garnish, serve. Composite long-horizon — maps directly to VLABench Track-6 (where SR for π0-Fast peaks at 1.6%).

Long-horizonMulti-skillTool-use

DIFF

T-04

Disturbed Pour

↳ Failure Analysis · RoboBench↳ Reactive Control · ARC

Pour granular media into a vessel while an adversarial human nudges the workspace. Online replanning, force-aware control, recovery from contact failure.

RecoveryGranularReactive

DIFF

T-05

OOD Glassware

↳ Cross-object · VLABench T2↳ Perturbation · VLATest RQ3-5

Manipulate fragile transparent items absent from training distribution. Camera-pose, lighting, and object-mesh perturbations sweep the VLATest fuzzing operators.

OODForce-awareFuzzing

DIFF

T-06

Partner-Handoff

↳ Dynamic Affordance · RoboBench↳ Safety Envelope · ARC

Two arms — one human, one robotic — collaborate to pass and assemble pieces. Intent inference, timing, dynamic affordance, safety envelope under shared workspace.

HRIBimanualTiming

DIFF

T-07

Map-Free Navigation

↳ Navigation Affordance · RoboBench↳ Spatial · VLABench

Navigate an unmapped office to fetch an object specified only by referring expression ("the blue mug Maria left near the window"). Spatial language + memory-driven exploration.

SpatialMemoryLanguage

DIFF

T-08

Tool-Improvise

↳ Common Sense · VLABench T3↳ Physical Law · VLABench

Intended tool is missing. Reach the goal using a non-canonical substitute (a ruler in place of a spatula). Tests creative reuse and physical-law reasoning.

ReasoningTool-useImprovisation

DIFF

§ 04

Per-task score matrix.

Mean of 32 rollouts
Heat: relative-to-best
↓ scroll horizontally

Model · Task →

T-01

T-02

T-03

T-04

T-05

T-06

T-07

T-08

Helix-2Figure

33.0

32.1

34.9

45.8

55.5

51.5

53.6

41.7

π0.5Physical Intelligence

45.9

45.6

44.6

48.7

49.8

38.2

35.2

22.3

Gemini-Robotics 1.5Google DeepMind

49.4

40.5

31.9

30.9

31.0

22.5

26.2

21.7

GR00T N1.5NVIDIA

32.6

22.7

17.2

22.9

31.4

30.9

39.9

37.0

RoboBrain-2.0BAAI

22.7

21.2

23.7

34.7

44.8

41.6

44.4

33.2

π0-FastPhysical Intelligence

35.6

35.7

35.5

40.3

42.1

30.8

27.7

14.4

OpenVLA-7BStanford / TRI

38.8

30.6

22.3

21.2

20.9

11.8

14.6

9.5

RDT-1BTsinghua

21.3

11.0

8.0

9.7

17.6

16.7

25.8

23.3

Octo-baseBerkeley AI

8.0

18.3

28.8

26.2

29.7

19.2

[ Best per column highlighted · normalized 0–100 ]DOWNLOAD CSV ↗

§ 05

Academic precedent.
Three benchmarks. We compose them.

Independent · peer-reviewed
Cited verbatim
Validated against 20+ annotators

P/01arXiv:2510.17801 · Oct 2025

RoboBench

An MLLM-as-Embodied-Brain Evaluation Benchmark

PKU · BAAI · Fudan · USTB · Beijing Humanoid

System-2 cognition. 5 dimensions, 14 capabilities, 25 tasks, 6,092 QA pairs. Plans rolled out through an MLLM-simulated world with human-annotated state DAGs.

Best perception (Gemini-2.5-Pro): 62.96
Best generalized planning: 39.33
Failure-analysis (best model): 45.14
Human ceiling (planning): 69.83

▍Frontier MLLMs trail human cognition by 10–35 points across embodied dimensions.

P/02ICCV 2025 · Fudan · OpenMOSS

VLABench

Long-Horizon Language-Conditioned Manipulation

Zhang et al.

System-1 execution. 100 task categories (60 primitive + 40 composite), 2,000+ 3D objects, MuJoCo. Six tracks measuring skill, generalization, semantic adaptation, long-horizon composition.

π0-Base · Track 1 SR: 13.2
π0-Fast · Track 6 SR: 1.6
Cross-domain (Track 5) SR: 0.0
Difficulty Level vs LIBERO: 75.96 / 17.96

▍State-of-the-art VLAs score in the single digits on real generalization.

P/03FSE 2025 · U. Alberta · U. Tokyo

VLATest

Testing & Evaluating Vision-Language-Action Models

Wang, Zhou, Song, Huang, Shu, Ma

Robustness fuzzing. 18,604 generated scenes, 78,604 rollouts, 580+ GPU hours, ten testing operators across object configuration, lighting, camera pose, instruction phrasing.

Models tested (RT / Octo / OpenVLA): 7
Lighting/camera robustness: drops
Unseen-object generalization: weak
Paraphrase robustness: fragile

▍Pretrained VLAs lack the robustness necessary for practical deployment.

[ ARC composes RoboBench (brain) + VLABench (policy) + VLATest (robust) and adds an industrial axis — cycle time, MTBF, safety envelope, regulatory-grade audit trail — that no academic benchmark covers. ]

§ 06

Why now.
Three cohorts. One score they can all defend.

VLAs are scaling faster
than internal benches calibrate.
ARC is the third-party check.

C/01VLA labs

Independent calibration before customer ship.

Internal benches drift. Customers don't trust your numbers. ARC publishes a signed score, on hardware you don't own, that your sales team can put in a deck the next morning.

Submit-to-score SLA: ≤ 72 h

C/02Robot OEMs · fleet operators

Apples-to-apples on the workcell you actually buy.

Pick a candidate model. We run it on your reference workcell SKU under our perturbation envelope and hand back a defensible procurement-grade comparison — not a vendor pitch deck.

Tasks · rollouts each: 8 / 32

C/03Insurers · regulators

An auditable rollout trail for safety review.

Every episode is logged at 30 Hz, signed, and replayable from raw observation. Safety envelopes (force, reach, exclusion zones) are scored as first-class metrics — not an afterthought.

Signed log rate: 30 Hz

[ ARC is operated as an independent lab. We do not sell models, we do not take payment from labs whose models we score, and every protocol revision is published before the next rollout cycle. ]

State-of-the-art VLAs score in the single digits on
generalized long-horizon manipulation. ICCV said it.
FSE said it. We just keep score.

VLABench Track-6 SR — π0-Fast: 1.6 / OpenVLA: 0.4 / Octo: 0.0

State of the art,measured in thereal world.

Two systems, one score.

Adversarial by construction.

MLLM-as-world-simulator scoring.

Open protocol, real hardware.

Continuous re-testing.

Failure is the product.

Cluttered Pantry

Translate-and-Fold

Cafeteria Tray

Disturbed Pour

OOD Glassware

Partner-Handoff

Map-Free Navigation

Tool-Improvise

RoboBench

VLABench

VLATest

Independent calibration before customer ship.

Apples-to-apples on the workcell you actually buy.

An auditable rollout trail for safety review.

State of the art,
measured in the
real world.