Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm (Nov 2025)
- Link: http://arxiv.org/abs/2511.04570v1
- Date: November 2025
Abstract
This paper proposes “Thinking with Video” as a new multimodal reasoning paradigm, leveraging video generation models like Sora-2 to address limitations of static text/image reasoning. It introduces VideoThinkBench, a benchmark with vision-centric (e.g., eyeballing puzzles, mazes) and text-centric tasks (e.g., MATH, MMMU). Evaluation shows Sora-2 is a strong reasoner, often comparable to or surpassing SOTA VLMs, and its performance benefits from self-consistency and in-context learning. This positions video generation as a potential unified multimodal understanding and generation model.
Key Topics:
- Thinking with Video
- Video Generation
- Multimodal Reasoning
- Sora-2
- VideoThinkBench
- Self-Consistency
- In-Context Learning
- Spatial Reasoning