Article Source

NEW VISUAL CoT Reasoning

Abstract

A new Study by NVIDIA, Stanford Univ and MIT uncover new methods for VISUAL Chain-of-Thought reasoning over complex topics. They transpose the “linguistic CoT” to a “visual CoT” by teaching an AI system to generate sub-goal images for VLA models. Robotic AI models.

Vision-language-action models (VLAs) have shown potential in leveraging pre-trained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input-output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capability. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames auto-regressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. We demonstrates that CoT-VLA achieves strong performance in manipulation tasks in both the real world and simulation benchmarks.

All rights w/ authors:

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao 1,2, Yao Lu 1 Moo Jin Kim 2 Zipeng Fu 2

Zhuoyang Zhang 3, Yecheng Wu 1,3, Zhaoshuo Li 1, Qianli Ma 1,

Song Han 1,3 Chelsea Finn 2,

Ankur Handa 1, Ming-Yu Liu, Donglai Xiang 1

Gordon Wetzstein 2

Tsung-Yi Lin 1

from

1 NVIDIA

2 Stanford University

3 MIT

Stop Thinking, Just Do!