VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

1University of Central Florida   2University of Würzburg   3University of Southern California   4NVIDIA Research Corresponding Author
VLA-Thinker Thinking-with-Image Reasoning Framework

VLA-Thinker introduces a thinking-with-image reasoning paradigm for Vision-Language-Action models. Instead of treating visual observations as static context, the model dynamically queries task-relevant visual regions during reasoning, enabling an interleaved perception–reasoning–action process that improves long-horizon robotic manipulation tasks.

Abstract

Vision–Language–Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. Instead of relying on a single visual encoding, the model can actively query task-relevant visual regions through tool invocation during reasoning, enabling an interleaved perception–reasoning–action process. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning–action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks.

Method

Architectural Overview

The upper panel illustrates the main process of our proposed Thinking-with-Image framework. Language instructions and visual observations are encoded into a shared VLM, enabling interleaved reasoning and dynamic zoom-in perception before action generation. The lower panel presents the two-stage training strategy: (1) SFT cold-start to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align multimodal reasoning–action trajectories with task-level objectives under sparse rewards.

Performance

LIBERO Benchmark Results

LIBERO Benchmark. VLA-Thinker achieves a 97.5% success rate, outperforming the OpenVLA-OFT baseline by +6.5%.

RoboTwin Benchmark Results

RoboTwin 2.0 Benchmark. VLA-Thinker improves performance across short-, medium-, and long-horizon robotic manipulation tasks.