ARC·vla

State of the art,
measured in the
real world.

ARC is the independent benchmark VLA labs run before they ship to customers. We measure the cognitive brain — System 2, after RoboBench (PKU/BAAI) — and the motor policy — System 1, after VLABench (ICCV ′25) — on the same physical rollout, then fuzz the result with the ten perturbation operators of VLATest (FSE ′25). Every score is signed, replayable, and tied to a hardware-of-record you can audit. No simulators. No cherry-picked episodes. One number per model, defensible on a customer call.

8
Custom tasks
9
VLA models · live
2.4k
Hours bench-time
v0.4
Open protocol
§ 01

The leaderboard.
Brain · policy · robust · industrial.

Last sync 02:14 UTC
Cycle 2026·Q2
CI 95% reported
RankModelArchitectureBrainPolicyRobustIndustrialARC Composite
01
Helix-2Figure · 8.4 B · 2026·02
Hierarchical S2/S1 VLA58.418.247.171.0
41.6
02
π0.5Physical Intelligence · 3.3 B · 2025·11
Flow-matching VLA54.116.444.064.5
38.7
03
Gemini-Robotics 1.5Google DeepMind · · 2025·10
Multimodal VLA62.911.241.556.0
36.4
04
GR00T N1.5NVIDIA · 12 B · 2025·09
Dual-system VLA51.013.639.860.2
35.1
05
RoboBrain-2.0BAAI · 7 B · 2025·08
Embodied-finetuned MLLM + π adapter56.89.636.248.0
31.9
06
π0-FastPhysical Intelligence · 3.0 B · 2025·05
Discretized flow VLA47.29.235.752.4
29.8
07
OpenVLA-7BStanford / TRI · 7 B · 2024·09
LLaMA-2 backbone41.66.432.144.0
25.3
08
RDT-1BTsinghua · 1.2 B · 2024·10
Diffusion transformer36.26.028.439.1
22.4
09
Octo-baseBerkeley AI · 1.3 B · 2024·06
Diffusion transformer28.12.024.332.0
16.9
[ Composite weights — 0.25 brain · 0.35 policy · 0.25 robust · 0.15 industrial ]→ ACADEMIC PRECEDENT BELOW
§ 02

Six axioms for honest
robotics evaluation.

Spec v0.4.2
Open protocol
CC-BY-SA
AX–01

Two systems, one score.

We grade the cognitive brain (System 2) and the motor policy (System 1) on the same rollout — RoboBench-style symbolic fidelity above VLABench-style execution success.

/ Dual-system evaluation
AX–02

Adversarial by construction.

Every rollout perturbs the environment along the ten VLATest operators — confounding objects, lighting, camera pose, paraphrased instructions, OOD substitutions. Memorization fails by design.

/ Robustness primary
AX–03

MLLM-as-world-simulator scoring.

Plans are rolled out through an MLLM-simulated world with human-annotated state DAGs — Pearson-validated against 20 expert annotators. No BLEU, no LLM-pairwise theatre.

/ Statistical fidelity
AX–04

Open protocol, real hardware.

Two calibrated physical benches. Hardware ref, scoring rubric, rollout videos, failure-DAG annotations are all public. Any lab can replicate the suite. CC-BY-SA.

/ Reproducibility
AX–05

Continuous re-testing.

Models are re-evaluated on every checkpoint. The leaderboard reflects current capability — not the cherry-picked snapshot you submitted six months ago.

/ Live evaluation
AX–06

Failure is the product.

Each run emits a typed failure trace — execution / identification / common-sense / mode-specific (per RoboBench §5.3). Auditable, regression-comparable, deployment-grade.

/ Failure taxonomy
§ 03

Eight custom tasks.
Mapped to academic dimensions.

Physical bench A · B
16-camera capture · 120 Hz proprio
10 VLATest operators per rollout
T-01

Cluttered Pantry

Perception Reasoning · RoboBenchMesh & Texture · VLABench

Retrieve a specified item from a densely populated shelf containing visually similar distractors. Adversarial confounders, paraphrased instructions, drifting lighting between rollouts.

ManipulationPerceptionConfounders
DIFF
T-02

Translate-and-Fold

Instruction Comprehension · RoboBenchSemantic Adaptation · VLABench T4

Receive a paraphrased natural-language instruction (5 mutations per rollout, mixed languages) and fold a deformable object accordingly. Probes language grounding and instruction robustness.

LanguageDeformablesParaphrase
DIFF
T-03

Cafeteria Tray

Generalized Planning · RoboBenchLong-Horizon · VLABench T6

Plan and execute a 9-step assembly: open dispensers, portion, plate, garnish, serve. Composite long-horizon — maps directly to VLABench Track-6 (where SR for π0-Fast peaks at 1.6%).

Long-horizonMulti-skillTool-use
DIFF
T-04

Disturbed Pour

Failure Analysis · RoboBenchReactive Control · ARC

Pour granular media into a vessel while an adversarial human nudges the workspace. Online replanning, force-aware control, recovery from contact failure.

RecoveryGranularReactive
DIFF
T-05

OOD Glassware

Cross-object · VLABench T2Perturbation · VLATest RQ3-5

Manipulate fragile transparent items absent from training distribution. Camera-pose, lighting, and object-mesh perturbations sweep the VLATest fuzzing operators.

OODForce-awareFuzzing
DIFF
T-06

Partner-Handoff

Dynamic Affordance · RoboBenchSafety Envelope · ARC

Two arms — one human, one robotic — collaborate to pass and assemble pieces. Intent inference, timing, dynamic affordance, safety envelope under shared workspace.

HRIBimanualTiming
DIFF
T-07

Map-Free Navigation

Navigation Affordance · RoboBenchSpatial · VLABench

Navigate an unmapped office to fetch an object specified only by referring expression ("the blue mug Maria left near the window"). Spatial language + memory-driven exploration.

SpatialMemoryLanguage
DIFF
T-08

Tool-Improvise

Common Sense · VLABench T3Physical Law · VLABench

Intended tool is missing. Reach the goal using a non-canonical substitute (a ruler in place of a spatula). Tests creative reuse and physical-law reasoning.

ReasoningTool-useImprovisation
DIFF
§ 04

Per-task score matrix.

Mean of 32 rollouts
Heat: relative-to-best
↓ scroll horizontally
Model · Task →
T-01
T-02
T-03
T-04
T-05
T-06
T-07
T-08
Helix-2Figure
33.0
32.1
34.9
45.8
55.5
51.5
53.6
41.7
π0.5Physical Intelligence
45.9
45.6
44.6
48.7
49.8
38.2
35.2
22.3
Gemini-Robotics 1.5Google DeepMind
49.4
40.5
31.9
30.9
31.0
22.5
26.2
21.7
GR00T N1.5NVIDIA
32.6
22.7
17.2
22.9
31.4
30.9
39.9
37.0
RoboBrain-2.0BAAI
22.7
21.2
23.7
34.7
44.8
41.6
44.4
33.2
π0-FastPhysical Intelligence
35.6
35.7
35.5
40.3
42.1
30.8
27.7
14.4
OpenVLA-7BStanford / TRI
38.8
30.6
22.3
21.2
20.9
11.8
14.6
9.5
RDT-1BTsinghua
21.3
11.0
8.0
9.7
17.6
16.7
25.8
23.3
Octo-baseBerkeley AI
8.0
8.0
8.0
18.3
28.8
26.2
29.7
19.2
[ Best per column highlighted · normalized 0–100 ]DOWNLOAD CSV ↗
§ 05

Academic precedent.
Three benchmarks. We compose them.

Independent · peer-reviewed
Cited verbatim
Validated against 20+ annotators
P/01arXiv:2510.17801 · Oct 2025

RoboBench

An MLLM-as-Embodied-Brain Evaluation Benchmark

PKU · BAAI · Fudan · USTB · Beijing Humanoid

System-2 cognition. 5 dimensions, 14 capabilities, 25 tasks, 6,092 QA pairs. Plans rolled out through an MLLM-simulated world with human-annotated state DAGs.

Best perception (Gemini-2.5-Pro)
62.96
Best generalized planning
39.33
Failure-analysis (best model)
45.14
Human ceiling (planning)
69.83

Frontier MLLMs trail human cognition by 10–35 points across embodied dimensions.

P/02ICCV 2025 · Fudan · OpenMOSS

VLABench

Long-Horizon Language-Conditioned Manipulation

Zhang et al.

System-1 execution. 100 task categories (60 primitive + 40 composite), 2,000+ 3D objects, MuJoCo. Six tracks measuring skill, generalization, semantic adaptation, long-horizon composition.

π0-Base · Track 1 SR
13.2
π0-Fast · Track 6 SR
1.6
Cross-domain (Track 5) SR
0.0
Difficulty Level vs LIBERO
75.96 / 17.96

State-of-the-art VLAs score in the single digits on real generalization.

P/03FSE 2025 · U. Alberta · U. Tokyo

VLATest

Testing & Evaluating Vision-Language-Action Models

Wang, Zhou, Song, Huang, Shu, Ma

Robustness fuzzing. 18,604 generated scenes, 78,604 rollouts, 580+ GPU hours, ten testing operators across object configuration, lighting, camera pose, instruction phrasing.

Models tested (RT / Octo / OpenVLA)
7
Lighting/camera robustness
drops
Unseen-object generalization
weak
Paraphrase robustness
fragile

Pretrained VLAs lack the robustness necessary for practical deployment.

[ ARC composes RoboBench (brain) + VLABench (policy) + VLATest (robust) and adds an industrial axis — cycle time, MTBF, safety envelope, regulatory-grade audit trail — that no academic benchmark covers. ]
§ 06

Why now.
Three cohorts. One score they can all defend.

VLAs are scaling faster
than internal benches calibrate.
ARC is the third-party check.
C/01VLA labs

Independent calibration before customer ship.

Internal benches drift. Customers don't trust your numbers. ARC publishes a signed score, on hardware you don't own, that your sales team can put in a deck the next morning.

Submit-to-score SLA
≤ 72 h
C/02Robot OEMs · fleet operators

Apples-to-apples on the workcell you actually buy.

Pick a candidate model. We run it on your reference workcell SKU under our perturbation envelope and hand back a defensible procurement-grade comparison — not a vendor pitch deck.

Tasks · rollouts each
8 / 32
C/03Insurers · regulators

An auditable rollout trail for safety review.

Every episode is logged at 30 Hz, signed, and replayable from raw observation. Safety envelopes (force, reach, exclusion zones) are scored as first-class metrics — not an afterthought.

Signed log rate
30 Hz
[ ARC is operated as an independent lab. We do not sell models, we do not take payment from labs whose models we score, and every protocol revision is published before the next rollout cycle. ]

State-of-the-art VLAs score in the single digits on
generalized long-horizon manipulation. ICCV said it.
FSE said it. We just keep score.

VLABench Track-6 SR — π0-Fast: 1.6 / OpenVLA: 0.4 / Octo: 0.0