P/01arXiv:2510.17801 · Oct 2025RoboBench
An MLLM-as-Embodied-Brain Evaluation Benchmark
PKU · BAAI · Fudan · USTB · Beijing Humanoid
System-2 cognition. 5 dimensions, 14 capabilities, 25 tasks, 6,092 QA pairs. Plans rolled out through an MLLM-simulated world with human-annotated state DAGs.
- Best perception (Gemini-2.5-Pro)
- 62.96
- Best generalized planning
- 39.33
- Failure-analysis (best model)
- 45.14
- Human ceiling (planning)
- 69.83
▍Frontier MLLMs trail human cognition by 10–35 points across embodied dimensions.
P/02ICCV 2025 · Fudan · OpenMOSSVLABench
Long-Horizon Language-Conditioned Manipulation
Zhang et al.
System-1 execution. 100 task categories (60 primitive + 40 composite), 2,000+ 3D objects, MuJoCo. Six tracks measuring skill, generalization, semantic adaptation, long-horizon composition.
- π0-Base · Track 1 SR
- 13.2
- π0-Fast · Track 6 SR
- 1.6
- Cross-domain (Track 5) SR
- 0.0
- Difficulty Level vs LIBERO
- 75.96 / 17.96
▍State-of-the-art VLAs score in the single digits on real generalization.
P/03FSE 2025 · U. Alberta · U. TokyoVLATest
Testing & Evaluating Vision-Language-Action Models
Wang, Zhou, Song, Huang, Shu, Ma
Robustness fuzzing. 18,604 generated scenes, 78,604 rollouts, 580+ GPU hours, ten testing operators across object configuration, lighting, camera pose, instruction phrasing.
- Models tested (RT / Octo / OpenVLA)
- 7
- Lighting/camera robustness
- drops
- Unseen-object generalization
- weak
- Paraphrase robustness
- fragile
▍Pretrained VLAs lack the robustness necessary for practical deployment.