SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

Jungho Lee   Minhyeok Lee   Sunghun Yang   Minseok Kang   Sangyoun Lee
Yonsei University



Abstract


3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a sigificant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value Decomposition (SVD) step. This eliminates the need for the Iteratively Reweighted Least Squares (IRLS) optimization commonly used in prior work, leading to substantial speed-ups. We evaluate SwiftVGGT on multiple datasets and show that it achieves state-of-the-art reconstruction quality while requiring only 33% of the inference time of recent VGGT-based large-scale reconstruction approaches.

Methods

Architecture

SwiftVGGT processes thousands of input images by dividing them into sliding-window chunks through VGGT. To reduce inference time, we eliminate the IRLS optimization step by applying reliability-guided point sampling. Furthermore, we utilize the patch tokens obtained from the VGGT's DINO transformer for loop detection directly, which further decreases the overall inference cost.

Quantitative Results

Quantitative Result

Visualization - KITTI



DROID-SLAM


VGGT-Long
SwiftVGGT