Stop Thinking, Just Do!

Sungsoo Kim's Blog

Video Generation as a Promising Multimodal Reasoning Paradigm

tagsTags

14 November 2025


Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm (Nov 2025)

Abstract

This paper proposes “Thinking with Video” as a new multimodal reasoning paradigm, leveraging video generation models like Sora-2 to address limitations of static text/image reasoning. It introduces VideoThinkBench, a benchmark with vision-centric (e.g., eyeballing puzzles, mazes) and text-centric tasks (e.g., MATH, MMMU). Evaluation shows Sora-2 is a strong reasoner, often comparable to or surpassing SOTA VLMs, and its performance benefits from self-consistency and in-context learning. This positions video generation as a potential unified multimodal understanding and generation model.

Key Topics:

  • Thinking with Video
  • Video Generation
  • Multimodal Reasoning
  • Sora-2
  • VideoThinkBench
  • Self-Consistency
  • In-Context Learning
  • Spatial Reasoning