§ 02 · Methodology

Methodology.

How ARC-vla evaluates robot intelligence — without simulators, without cherry-picked episodes.

Spec v0.4.2
Open protocol
CC-BY-SA

§ 02.1

Overview.

Real hardware
Calibrated scoring
Continuous re-evaluation

We score what ships,
not what trends.

ARC-vla measures vision-language-action models on physical rollouts across two calibrated benches. Every model receives the same eight tasks under the same ten perturbation operators, with execution traces, failure DAGs, and statistical fidelity preserved end-to-end.

Scoring is multi-dimensional — cognition, execution, robustness, and industrial readiness are graded independently and aggregated into a single composite. The protocol is open; any lab can replicate the bench from the published hardware reference.

§ 02.2

Evaluation pipeline.

Three stages
Reproducible
Audit-traceable

STEP 01

Model submission

Authors submit a checkpoint and inference container. ARC pins the build, hashes the weights, and records the submission against the open protocol version.

STEP 02

Physical rollout

The model runs eight tasks on calibrated hardware. Sixteen-camera capture and 120 Hz proprio are logged alongside ten VLATest perturbation operators per rollout.

STEP 03

Scoring & aggregation

Each rollout is scored along brain, policy, robust, and industrial axes — then aggregated into the ARC composite and published with full failure traces.

§ 02.3

Scoring dimensions.

Four axes
Independently graded
0–100 native

Brain

System-2 cognition

What the model thinks before it acts. Plans, identifies, recovers, asks.

Plan generation under partial information
Failure-mode reasoning & recovery
Object & affordance identification

Policy

System-1 execution

What the body actually does on the bench. Success rate on long-horizon rollouts.

Skill chaining & contact sequencing
Cross-object generalization
Bimanual & deformable manipulation

Robust

Perturbation resilience

What survives when the world stops cooperating. Ten VLATest operators per rollout.

Lighting & camera-pose drift
Paraphrased instructions
OOD object & texture swaps

Industrial

Real-world readiness

What ships. Cycle time, MTBF, safety envelope, audit trail.

Mean time between failures
Cycle-time variance under load
Safety-envelope adherence & audit log

§ 02.4

Composite score.

Weighted aggregation
Disclosed in protocol

ARC = 0.25·brain + 0.35·policy + 0.25·robust + 0.15·industrial

0.25

Brain

0.35

Policy

0.25

Robust

0.15

Industrial

Policy carries the heaviest weight because execution is what eventually deploys. Brain and robust carry equal weight — reasoning without resilience cannot ship, and resilience without reasoning cannot recover. Industrial is intentionally smaller; it captures the floor, not the ceiling, and grows in importance as a model approaches deployment.

§ 02.5

Robustness fuzzing.

Ten operators
Per rollout
VLATest-aligned

Each rollout perturbs the environment along ten operators. Memorization fails by construction — the policy must generalize across lighting, paraphrase, OOD objects, sensor dropout, and timing skew.

OP-01

Lighting

Color temperature, intensity, and direction sweep across rollouts.

OP-02

Camera pose

±15° yaw/pitch and translation offsets on the primary RGB head.

OP-03

Object texture

Albedo and roughness substitutions on training-set distractors.

OP-04

Object mesh

Shape variants and unseen instance-level meshes inserted at runtime.

OP-05

Instruction paraphrase

Five paraphrases per rollout, mixed languages where applicable.

OP-06

Adversarial confounders

Visually similar distractors placed at the goal location.

OP-07

Workspace drift

Table-height and origin offsets between trial blocks.

OP-08

Disturbance

Human-in-the-loop nudges and goal swaps mid-rollout.

OP-09

Sensor dropout

Partial observation: occluded camera or proprioception lag.

OP-10

Timing skew

Inference-latency and control-loop jitter injected on the policy bus.

§ 02.6

Reproducibility.

Open protocol
Public hardware ref
CC-BY-SA

Open

Protocol

Spec, scoring rubric, perturbation operators, and aggregation formula are all published. Versioned, peer-readable, and stable.

Ref

Hardware

Two calibrated benches with full BOM, controller firmware versions, capture rig, and calibration jigs. Replicable on commodity hardware.

Trace

Replays

Rollout videos, proprio streams, and failure-DAG annotations are public for every scored model. Submitters can verify their own runs.

§ 02.7

Failure philosophy.

Typed traces
Regression-comparable

Failure
is signal.

Every run emits a typed failure trace. Execution failures, identification failures, common-sense failures, and mode-specific failures are tracked separately — so a model that recovers gracefully is not penalized as if it had never tried.

Failure traces are auditable, regression-comparable across cycles, and deployment-grade. They are the primary product of the benchmark — the leaderboard is just an index.

› Execution · contact, force, trajectory
› Identification · object, affordance, location
› Common-sense · physical law, social context
› Mode-specific · paraphrase, OOD, sensor dropout