June 25, 2026

SpatialClaw: NVIDIA’s new approach to AI spatial reasoning

NVIDIA Research has introduced SpatialClaw, a new training-free framework that significantly advances how AI agents tackle three-dimensional and dynamic spatial reasoning tasks. Unlike traditional approaches that rely on rigid structured tool calls or one-shot code generation, SpatialClaw allows vision-language model (VLM)-backed agents to use executable Python code as their primary action interface within a persistent, stateful environment. This design enables highly flexible, iterative, and adaptive reasoning about complex visual scenes.

Spatial reasoning – understanding object positions, relationships, depths, movements, and interactions in 3D/4D environments – remains one of the most difficult challenges for modern VLMs. While these models excel at language and basic image interpretation, they frequently falter on precise geometric analysis, multi-step inference, and tasks involving dynamic scenes or multiple viewpoints. Existing agentic methods augment VLMs with perception tools (such as segmenters and depth estimators), but their potential is often constrained by rigid action interfaces that limit how reasoning processes can evolve during execution.

SpatialClaw addresses these limitations by maintaining a persistent Python kernel preloaded with input frames, perception modules, and geometry primitives from libraries like NumPy and SciPy. Instead of selecting from predefined commands or committing to a full program upfront, the agent writes and executes code step by step. It can:

treat perception outputs as ordinary, reusable Python variables;
inspect intermediate results;
revise its strategy based on execution feedback;
compose sophisticated, task-specific geometric computations that emerge during reasoning.

This interactive workflow supports open-ended analysis far beyond what fixed APIs or single-pass scripts allow. The system includes safety mechanisms and operates in a multi-turn loop of planning, execution, and observation.

On a comprehensive suite of 20 spatial reasoning benchmarks spanning static single-image, multi-view, general spatial, video, and 4D dynamic tasks, SpatialClaw achieved an average accuracy of 59.9%. This represents an 11.2 percentage point improvement over a recent state-of-the-art spatial agent (SpaceTools-Toolshed) using the same Gemma 4-31B backbone. Gains were consistent across six different VLM backbones (from the Qwen and Gemma families, ranging 26B-397B parameters) with no benchmark-specific tuning or additional training.

One of the study’s key findings is that performance gains stem primarily from the action interface itself rather than from specialized perception tools. Experiments showed that even when utility wrappers were removed, the framework maintained strong performance. Researchers found that the ability to compose, inspect, and revise reasoning steps through code contributed significantly to SpatialClaw’s effectiveness.

The framework’s architecture also highlights a broader shift in AI agent design. Instead of focusing solely on expanding an agent’s toolkit, SpatialClaw emphasizes creating a more expressive workspace where reasoning can unfold dynamically. This allows agents to adapt to complex spatial tasks that require multiple stages of analysis and decision-making.

SpatialClaw arrives amid growing industry interest in agentic AI and physical AI systems capable of understanding and interacting with the real world. As AI applications increasingly move into robotics, autonomous systems, simulation environments, and embodied intelligence, robust spatial reasoning is becoming a critical capability. NVIDIA’s latest research suggests that giving AI agents the freedom to reason through code may be a promising path toward more capable and adaptable spatial intelligence.

The full project, including code, detailed reasoning trajectories, presentation, and the research paper, is available on the SpatialClaw webpage and GitHub.