ARC·vla
§ 02 · Methodology

Methodology.

How ARC-vla evaluates robot intelligence — without simulators, without cherry-picked episodes.

Spec v0.4.2
Open protocol
CC-BY-SA
§ 02.1

Overview.

Real hardware
Calibrated scoring
Continuous re-evaluation

We score what ships,
not what trends.

ARC-vla measures vision-language-action models on physical rollouts across two calibrated benches. Every model receives the same eight tasks under the same ten perturbation operators, with execution traces, failure DAGs, and statistical fidelity preserved end-to-end.

Scoring is multi-dimensional — cognition, execution, robustness, and industrial readiness are graded independently and aggregated into a single composite. The protocol is open; any lab can replicate the bench from the published hardware reference.

§ 02.2

Evaluation pipeline.

Three stages
Reproducible
Audit-traceable
STEP 01

Model submission

Authors submit a checkpoint and inference container. ARC pins the build, hashes the weights, and records the submission against the open protocol version.

STEP 02

Physical rollout

The model runs eight tasks on calibrated hardware. Sixteen-camera capture and 120 Hz proprio are logged alongside ten VLATest perturbation operators per rollout.

STEP 03

Scoring & aggregation

Each rollout is scored along brain, policy, robust, and industrial axes — then aggregated into the ARC composite and published with full failure traces.

§ 02.3

Scoring dimensions.

Four axes
Independently graded
0–100 native
Brain
System-2 cognition

What the model thinks before it acts. Plans, identifies, recovers, asks.

  • Plan generation under partial information
  • Failure-mode reasoning & recovery
  • Object & affordance identification
Policy
System-1 execution

What the body actually does on the bench. Success rate on long-horizon rollouts.

  • Skill chaining & contact sequencing
  • Cross-object generalization
  • Bimanual & deformable manipulation
Robust
Perturbation resilience

What survives when the world stops cooperating. Ten VLATest operators per rollout.

  • Lighting & camera-pose drift
  • Paraphrased instructions
  • OOD object & texture swaps
Industrial
Real-world readiness

What ships. Cycle time, MTBF, safety envelope, audit trail.

  • Mean time between failures
  • Cycle-time variance under load
  • Safety-envelope adherence & audit log
§ 02.4

Composite score.

Weighted aggregation
Disclosed in protocol
ARC = 0.25·brain + 0.35·policy + 0.25·robust + 0.15·industrial
0.25
Brain
0.35
Policy
0.25
Robust
0.15
Industrial

Policy carries the heaviest weight because execution is what eventually deploys. Brain and robust carry equal weight — reasoning without resilience cannot ship, and resilience without reasoning cannot recover. Industrial is intentionally smaller; it captures the floor, not the ceiling, and grows in importance as a model approaches deployment.

§ 02.5

Robustness fuzzing.

Ten operators
Per rollout
VLATest-aligned

Each rollout perturbs the environment along ten operators. Memorization fails by construction — the policy must generalize across lighting, paraphrase, OOD objects, sensor dropout, and timing skew.

OP-01
Lighting
Color temperature, intensity, and direction sweep across rollouts.
OP-02
Camera pose
±15° yaw/pitch and translation offsets on the primary RGB head.
OP-03
Object texture
Albedo and roughness substitutions on training-set distractors.
OP-04
Object mesh
Shape variants and unseen instance-level meshes inserted at runtime.
OP-05
Instruction paraphrase
Five paraphrases per rollout, mixed languages where applicable.
OP-06
Adversarial confounders
Visually similar distractors placed at the goal location.
OP-07
Workspace drift
Table-height and origin offsets between trial blocks.
OP-08
Disturbance
Human-in-the-loop nudges and goal swaps mid-rollout.
OP-09
Sensor dropout
Partial observation: occluded camera or proprioception lag.
OP-10
Timing skew
Inference-latency and control-loop jitter injected on the policy bus.
§ 02.6

Reproducibility.

Open protocol
Public hardware ref
CC-BY-SA
Open
Protocol

Spec, scoring rubric, perturbation operators, and aggregation formula are all published. Versioned, peer-readable, and stable.

Ref
Hardware

Two calibrated benches with full BOM, controller firmware versions, capture rig, and calibration jigs. Replicable on commodity hardware.

Trace
Replays

Rollout videos, proprio streams, and failure-DAG annotations are public for every scored model. Submitters can verify their own runs.

§ 02.7

Failure philosophy.

Typed traces
Regression-comparable

Failure
is signal.

Every run emits a typed failure trace. Execution failures, identification failures, common-sense failures, and mode-specific failures are tracked separately — so a model that recovers gracefully is not penalized as if it had never tried.

Failure traces are auditable, regression-comparable across cycles, and deployment-grade. They are the primary product of the benchmark — the leaderboard is just an index.

  • › Execution · contact, force, trajectory
  • › Identification · object, affordance, location
  • › Common-sense · physical law, social context
  • › Mode-specific · paraphrase, OOD, sensor dropout