VGGT: Visual Geometry Grounded Transformer (Mar 2025)
Abstract
VGGT is a large feed-forward transformer that directly infers key 3D scene attributes—including camera parameters, depth maps, point maps, and 3D point tracks—from one or many input views in seconds. By bypassing traditional iterative optimization techniques like Bundle Adjustment, it achieves state-of-the-art performance in 3D reconstruction and serves as a versatile backbone for downstream tasks like video tracking and novel view synthesis.
Key Topics:
- 3D Reconstruction
- Visual Geometry
- Transformers
- Structure from Motion
- Camera Pose Estimation