CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.
近年来,基于三维高斯点云(3D Gaussian Splatting)的通用前馈方法因其在有限资源下重建三维场景的潜力而备受关注。这些方法通过单次前向传播,从少量图像中生成由每像素三维高斯基元参数化的三维辐射场。然而,与利用多视角对应性的多视图方法相比,单视图图像的三维场景重建仍然是一个尚未深入探索的领域。 在本研究中,我们提出了 CATSplat,一种创新的基于 Transformer 的框架,旨在突破单目设置中的固有限制。首先,我们通过视觉-语言模型的文本引导来补充单张图像中不足的信息。通过交叉注意力机制,将文本嵌入中的场景特定上下文细节引入重建过程,超越了单纯依赖视觉线索的限制,为上下文感知的三维场景重建提供了新思路。此外,我们还利用来自三维点特征的空间引导,以在单视图条件下实现全面的几何理解。通过三维先验,图像特征能够捕获丰富的结构信息,从而在无需多视图技术的情况下预测三维高斯点云。 在大规模数据集上的广泛实验表明,CATSplat 在单视图三维场景重建和高质量新视图合成方面达到了当前最先进的性能,显著提升了单视图条件下的三维重建能力。