Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm (Nov 2025)

Link: http://arxiv.org/abs/2511.04570v1
Date: November 2025

Abstract

This paper proposes “Thinking with Video” as a new multimodal reasoning paradigm, leveraging video generation models like Sora-2 to address limitations of static text/image reasoning. It introduces VideoThinkBench, a benchmark with vision-centric (e.g., eyeballing puzzles, mazes) and text-centric tasks (e.g., MATH, MMMU). Evaluation shows Sora-2 is a strong reasoner, often comparable to or surpassing SOTA VLMs, and its performance benefits from self-consistency and in-context learning. This positions video generation as a potential unified multimodal understanding and generation model.

Key Topics:

Thinking with Video
Video Generation
Multimodal Reasoning
Sora-2
VideoThinkBench
Self-Consistency
In-Context Learning
Spatial Reasoning

Stop Thinking, Just Do!

Video Generation as a Promising Multimodal Reasoning Paradigm

Tags

14 November 2025

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm (Nov 2025)

Abstract

Key Topics: