ActionEQA: Action Interface for Embodied Question Answering

Overview

Abstract

While Vision-Language Models (VLMs) are increasingly integral to embodied intelligence, a significant action understanding bottleneck persists in translating high-level semantic instructions into precise low-level physical actions. However, current benchmarks for embodied agents primarily focus on high-level perception and planning, failing to capture the depth and nature of this semantic-to-physical gap.

To address this, we introduce ActionEQA, the first Embodied Question Answering (EQA) benchmark designed to methodically evaluate the ability of VLMs to bridge this critical yet underexplored semantic-physical divide. Grounded in real-world robotics data, ActionEQA thoroughly analyzes VLMs' grasp of the action interface using a dual-tier design: (1) a Three-Tiered Action Hierarchy for pinpointing the depth at which VLMs' action reasoning collapses, and (2) Bidirectional Reasoning Tasks for testing whether VLMs struggle more to predict action outcomes or infer the actions that led to them.

Our key findings reveal: (1) The primary bottleneck in action understanding occurs at the mid-level, arising from the challenge of grounding compositional language in 3D physical geometry. (2) VLMs are more adept at inferring past actions than predicting their future outcomes. (3) Richer visual inputs require greater spatial reasoning from VLMs to map actions to physical geometry. (4) Within the action hierarchy, model failures shift from predominantly perceptual errors at the high level to flawed geometric and physical reasoning at the low level.

Key Contributions

Research Highlights

Three-Tiered Action Hierarchy

Evaluates VLMs at High (task goals), Mid (semantic motion), and Low (7-DoF motor commands) levels — enabling pinpoint diagnosis of where action reasoning collapses across the semantic-to-physical spectrum.

Bidirectional Reasoning

State Prediction: given a starting frame + action, predict the goal state. Action Inference: given before/after frames, identify the action that caused the transition.

Real-World Robot Data

Grounded in three large-scale datasets: BridgeData V2, DROID, and RT-1, covering 8,795 questions, 26,213 unique images, and 2,629 unique episodes.

Mid-Level Bottleneck Discovery

Contrary to expectation, a V-shaped performance trend emerges: models struggle most at mid-level (semantic motion) — not low-level. Verified via paired t-tests across all 34 models (p < 0.001, mean diff. +18.93% vs. high, +8.66% vs. low).

Comprehensive 34-Model Evaluation

Evaluates 26 open-weights and 8 proprietary VLMs. Best model (Gemini-2.5-Pro) achieves 58.4% overall vs. 95.6% human performance — a 37.2% gap highlighting the immense challenge.

Reasoning > Perception

Richer visual inputs (more camera views) don't consistently help and can hurt advanced models. Egocentric views dominate single-view scenarios with a +23.8% advantage on H-AI tasks — the bottleneck is reasoning, not perception.

Benchmark

Dataset Overview

ActionEQA is a visual question-answering benchmark for robot action understanding, designed to probe VLMs across two task types and three action abstraction levels.

8,795

Total Questions

26,213

Unique Images

2,629

Unique Episodes

6

Task Categories

34

VLMs Evaluated

Evaluation

Benchmark Results

Accuracy (%) on State Prediction (SP) and Action Inference (AI) across High (H), Mid (M), and Low (L) action levels on DROID, BridgeData V2, and RT-1.

🥇 1st in group 🥈 2nd in group 🥉 3rd in group Open-Weights Proprietary Click any column header to sort ↑↓

Analysis

Empirical Findings

Our evaluation of 34 VLMs on ActionEQA reveals four critical insights into the semantic-to-physical gap, from a surprising V-shaped bottleneck to a fundamental shift in error nature across the action hierarchy.

01

The Unexpected Mid-Level Bottleneck

Rather than a monotonic decline, VLMs exhibit a V-shaped performance profile — they struggle most with mid-level semantic motion descriptions, not low-level motor commands.

Contrary to expectation, VLMs do not degrade monotonically from abstract to concrete actions. Paired t-tests confirm a V-shaped bottleneck at the mid-level is statistically significant across all 34 evaluated models: mid-level accuracy is lower than high-level by +18.93 pp and lower than low-level by +8.66 pp (both p < 0.001). The core challenge is grounding compositional language — e.g., "Move along positive X-axis, Rotate clockwise around Z-axis" — into 3D physical geometry.

Accuracy (%) by Action Level — Representative Models vs. Human Baseline

0%25%50%75%100%

High-Level Mid-Level (bottleneck) Low-Level

Overall accuracy per level computed via hierarchical averaging across all applicable datasets.

Rotation: The Hardest Mid-Level Action

Within the mid-level bottleneck, rotational actions consistently yield the highest error rates across all view configurations — confirmed by paired t-test (p = 0.013, Cohen's d = 2.04 vs. translation). Gripper actions (open/close) are the easiest and even improve slightly with more views, while rotation and translation errors remain stubbornly high regardless of perceptual richness.

Mid-Level Action Type	1 View	2 Views	3 Views	4 Views	Status
Rotation	38.22%	39.35%	38.87%	39.76%	Highest & Persistent
Translation	34.95%	35.77%	35.64%	36.18%	No Improvement
Gripper	25.89%	24.91%	24.73%	24.28%	Benefits from More Views

Average normalized error rates (%) on BridgeData V2 by mid-level action type as exocentric views increase from 1 to 4.

02

VLMs Excel at Explaining the Past, Not Predicting the Future

A consistent asymmetry: Action Inference (retrospective) outperforms State Prediction (predictive) by 4.31% on average across all 34 models.

Model	Type	SP Avg	AI Avg	Δ (AI−SP)
Gemini-2.5-Pro	Prop.	53.2%	63.7%	+10.5% ↑
Gemini-2.5-Flash	Prop.	42.7%	51.2%	+8.5% ↑
GPT-4.1	Prop.	48.9%	44.0%	−4.9% ↓
Ovis2.5-9B	Open	35.3%	49.6%	+14.3% ↑
InternVL3-14B	Open	38.9%	45.5%	+6.7% ↑
GLM-4.5V	Open	48.0%	44.4%	−3.6% ↓
Human	—	96.2%	94.9%	−1.3%

SP/AI averages computed via hierarchical averaging across all applicable datasets and action levels. Δ = AI − SP; positive values indicate AI outperforms SP.

ActionEQA's bidirectional framework exposes a striking asymmetry in VLM reasoning: models find it systematically easier to infer what action caused an observed outcome (Action Inference) than to predict the visual result of a planned action (State Prediction).

This gap is statistically significant across all 34 evaluated models (p < 0.001, paired t-test), with Action Inference outperforming State Prediction by an average of 4.31 pp. The top model, Gemini-2.5-Pro, shows the largest absolute gap: 63.7% AI vs. 53.2% SP — a 10.5 pp deficit in forward predictive reasoning.

This pattern reveals a fundamental weakness: current VLMs are better at retrospective explanation than prospective prediction of physical dynamics. Interestingly, humans exhibit the opposite trend — slightly better at SP — suggesting that predictive physical simulation is a distinctly human cognitive strength.

03

Camera Perspective Matters More Than Quantity

More views don't guarantee better performance — and the egocentric (first-person) perspective dominates in all single-view settings.

A — Richer Visual Input Can Impose a Reasoning Burden

An ablation varying exocentric views from 1–4 on BridgeData V2 reveals a counterintuitive finding: advanced reasoning models like Gemini-2.5-Flash consistently degrade as more views are added, while open-weights models like InternVL3-14B improve monotonically. The divergence is strongly tied to task direction — Action Inference benefits from richer visual evidence, while State Prediction is fragile under multiple conflicting perspectives, as predicting a future outcome requires precise geometric reasoning that additional views can undermine rather than support.

Overall Average Accuracy (%) vs. Number of Exocentric Views — BridgeData V2

Overall accuracy averaged across all six task categories (H/M/L × SP/AI).

State Prediction (SP) Average

Action Inference (AI) Average

SP/AI averages computed from H/M/L subtasks per model (same Y-axis scale for direct comparison). The contrasting trend — AI rising for most models while SP stagnates or falls — reveals why additional views help retrospective reasoning but burden predictive reasoning.

B — Egocentric View Dominates in Single-View Scenarios

The egocentric (first-person) view unequivocally outperforms the exocentric (third-person) view across every action level and task direction (SP: p = 0.014; AI: p < 0.001), with the most striking advantage at high-level Action Inference (+23.8 pp). Counterintuitively, fusing both views degrades performance in almost all settings — the combined input introduces conflicting signals rather than synergy, with the worst drop at −6.8 pp for high-level State Prediction.

Task	Action Level	Ego	Exo	Combined	Best Single View	Δ Combined
State Pred.	High-Level	55.4	43.2	48.6	Ego (+12.2)	−6.8
	Mid-Level	35.9	34.0	35.5	Ego (+1.9)	−0.4
	Low-Level	35.5	34.8	33.7	Ego (+0.7)	−1.8
Action Inf.	High-Level	80.3	56.5	77.0	Ego (+23.8)	−3.3
	Mid-Level	37.2	31.6	33.7	Ego (+5.6)	−3.5
	Low-Level	51.6	48.6	51.8	Ego (+3.0)	+0.2

Average accuracy (%) across representative VLMs on DROID dataset (the only source with both camera perspectives). Green bold = best single-view value (Ego column). Red = degradation in Δ Combined vs. best single view.

04

Error Nature Shifts Across the Action Hierarchy

As actions become more granular, model failures shift from predominantly perceptual errors at the high level to geometric and physical reasoning errors at the low level.

A qualitative analysis of 150 sampled error cases from the top-performing model (Gemini-2.5-Pro) reveals a clear and systematic shift in the nature of failures across the three tiers. All errors fall into two broad categories: Perceptual Grounding Errors (misinterpreting the visual scene) and Reasoning Errors (flawed logic about physical dynamics). The balance between these two categories inverts as we move from high to low-level actions.

Distribution of Error Types by Action Level (Gemini-2.5-Pro, n=50 errors per level)

High

Perceptual: 59%

Reasoning: 41%

Mid

Perceptual: 51%

Reasoning: 49%

Low

Perceptual: 44%

Reasoning: 56%

Perceptual Grounding Errors Reasoning Errors

Error Distribution by Action Level

High-Level

Perceptual Grounding 59%

Neg. Hallucination — 31%

Pos. Hallucination — 28%

Reasoning Errors 41%

Spatiotemporal — 19%

Commonsense — 22%

Mid-Level

Perceptual Grounding 51%

Neg. Hallucination — 23%

Pos. Hallucination — 28%

Reasoning Errors 49%

Spatiotemporal — 24%

Commonsense — 25%

Low-Level

Perceptual Grounding 44%

Neg. Hallucination — 21%

Pos. Hallucination — 23%

Reasoning Errors 56%

Spatiotemporal — 32%

Commonsense — 24%

Complete Error Taxonomy

Errors break down into two main categories and four subcategories, with 11 distinct error types identified from qualitative analysis of 150 sampled failures (Gemini-2.5-Pro, 50 per action level).

Perceptual Grounding Errors

Model's understanding of the physical world is incorrect — a disconnect between what it "sees" and what is actually present. Dominant at High-Level.

Negative Hallucination (misses what IS there)

Static Perception Error Fails to perceive an object or its properties currently present and static in the environment.

Dynamic Perception Error Fails to detect or track a change when the gripper interacts with an object — e.g., misidentifying a held object mid-manipulation.

Rotation Misquantification Perceives that a rotation occurred but misjudges its direction or magnitude around the correct axis.

Translation Misquantification Perceives translational movement but fails to accurately sense its direction or distance.

Gripper State Misclassification Fails to determine whether the gripper is open, closed, or holding an object.

Positive Hallucination (invents what ISN'T there)

Action Preconditions Hallucination Incorrectly believes the conditions for an action are met — e.g., hallucinating the gripper has grasped an object when it is still open.

Action Outcomes Hallucination Invents a visual outcome that is not present — e.g., misinterpreting a shadow as a "wet mark" to satisfy a "wipe" command.

Dynamics Hallucination Fabricates a physical change — e.g., when a hidden chocolate bar is revealed, the model hallucinates a "placing" action.

Rotation / Translation / Gripper Hallucination Perceives a motion or gripper-state change that never occurred.

Reasoning Errors

Model correctly perceives the visual input but makes a logical mistake in interpreting or acting upon that information. Dominant at Low-Level.

Spatiotemporal Reasoning (space & time failures)

Temporal Reasoning Error Fails to correctly process the order of sequential states — e.g., believing an apple was placed into a bowl when it was actually picked up.

Spatial Reasoning Error Makes mistakes in understanding spatial relationships — e.g., misinterpreting a horizontal motion as "lifting into the air."

Reference Frame Grounding Error Confuses robot-centric and world-centric coordinate systems — e.g., interpreting "move along negative Y" as moving left when the frame says right.

Commonsense Reasoning (world-knowledge failures)

Physics Compliance Error Plans or predicts an action violating basic physics — e.g., ignoring the coupling that occurs when a gripper closes on a pot.

Postconditions Overinference Assumes an action caused extra side effects beyond its direct outcome — e.g., insisting a "move" command must end with an object released.

Action Specificity Error Correctly identifies objects involved but describes the action too imprecisely — factually true but not the most meaningful description.

Extraneous Action Error Selects an outcome image that includes an action not specified by the task — e.g., choosing an image where the gripper opens when no gripper change was requested.

Interactive

Dataset Viewer

Explore ActionEQA examples in real-time. Select a subset and browse the questions — images are loaded directly from HuggingFace.

Source Dataset

Action Level

Task Type

Select a subset above and click "Load Examples" to browse the benchmark.

View Full Dataset on HuggingFace

Reference

Citation

If you use ActionEQA in your research, please cite our paper:

@article{bao2026actioneqa,
  title   = {ActionEQA: Action Interface for Embodied Question Answering},
  author  = {Bao, Tianwei and Wang, Qineng and Wang, Kangrui and Deng, Mingkai
             and Liu, Guangyi and Mao, Jiayuan and Birnbaum, Larry and Hu, Zhiting
             and Xing, Eric P. and Wang, Zhaoran and Li, Manling},
  journal = {Transactions on Machine Learning Research},
  issn    = {2835-8856},
  year    = {2026},
  url     = {https://openreview.net/forum?id=HY2ruqdMt4},
}