Methodology.
How ARC-vla evaluates robot intelligence — without simulators, without cherry-picked episodes.
Overview.
We score what ships,
not what trends.
ARC-vla measures vision-language-action models on physical rollouts across two calibrated benches. Every model receives the same eight tasks under the same ten perturbation operators, with execution traces, failure DAGs, and statistical fidelity preserved end-to-end.
Scoring is multi-dimensional — cognition, execution, robustness, and industrial readiness are graded independently and aggregated into a single composite. The protocol is open; any lab can replicate the bench from the published hardware reference.
Evaluation pipeline.
Model submission
Authors submit a checkpoint and inference container. ARC pins the build, hashes the weights, and records the submission against the open protocol version.
Physical rollout
The model runs eight tasks on calibrated hardware. Sixteen-camera capture and 120 Hz proprio are logged alongside ten VLATest perturbation operators per rollout.
Scoring & aggregation
Each rollout is scored along brain, policy, robust, and industrial axes — then aggregated into the ARC composite and published with full failure traces.
Scoring dimensions.
What the model thinks before it acts. Plans, identifies, recovers, asks.
- Plan generation under partial information
- Failure-mode reasoning & recovery
- Object & affordance identification
What the body actually does on the bench. Success rate on long-horizon rollouts.
- Skill chaining & contact sequencing
- Cross-object generalization
- Bimanual & deformable manipulation
What survives when the world stops cooperating. Ten VLATest operators per rollout.
- Lighting & camera-pose drift
- Paraphrased instructions
- OOD object & texture swaps
What ships. Cycle time, MTBF, safety envelope, audit trail.
- Mean time between failures
- Cycle-time variance under load
- Safety-envelope adherence & audit log
Composite score.
Policy carries the heaviest weight because execution is what eventually deploys. Brain and robust carry equal weight — reasoning without resilience cannot ship, and resilience without reasoning cannot recover. Industrial is intentionally smaller; it captures the floor, not the ceiling, and grows in importance as a model approaches deployment.
Robustness fuzzing.
Each rollout perturbs the environment along ten operators. Memorization fails by construction — the policy must generalize across lighting, paraphrase, OOD objects, sensor dropout, and timing skew.
Reproducibility.
Spec, scoring rubric, perturbation operators, and aggregation formula are all published. Versioned, peer-readable, and stable.
Two calibrated benches with full BOM, controller firmware versions, capture rig, and calibration jigs. Replicable on commodity hardware.
Rollout videos, proprio streams, and failure-DAG annotations are public for every scored model. Submitters can verify their own runs.
Failure philosophy.
Failure
is signal.
Every run emits a typed failure trace. Execution failures, identification failures, common-sense failures, and mode-specific failures are tracked separately — so a model that recovers gracefully is not penalized as if it had never tried.
Failure traces are auditable, regression-comparable across cycles, and deployment-grade. They are the primary product of the benchmark — the leaderboard is just an index.
- › Execution · contact, force, trajectory
- › Identification · object, affordance, location
- › Common-sense · physical law, social context
- › Mode-specific · paraphrase, OOD, sensor dropout