We introduce NoPoSplat, a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from unposed sparse multi-view images. Our model, trained exclusively with photometric loss, achieves real-time 3D Gaussian reconstruction during inference. To eliminate the need for accurate pose input during reconstruction, we anchor one input view's local camera coordinates as the canonical space and train the network to predict Gaussian primitives for all views within this space. This approach obviates the need to transform Gaussian primitives from local coordinates into a global coordinate system, thus avoiding errors associated with per-frame Gaussians and pose estimation. To resolve scale ambiguity, we design and compare various intrinsic embedding methods, ultimately opting to convert camera intrinsics into a token embedding and concatenate it with image tokens as input to the model, enabling accurate scene scale prediction. We utilize the reconstructed 3D Gaussians for novel view synthesis and pose estimation tasks and propose a two-stage coarse-to-fine pipeline for accurate pose estimation. Experimental results demonstrate that our pose-free approach can achieve superior novel view synthesis quality compared to pose-required methods, particularly in scenarios with limited input image overlap. For pose estimation, our method, trained without ground truth depth or explicit matching loss, significantly outperforms the state-of-the-art methods with substantial improvements. This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
我们提出了NoPoSplat,一种能够从无位姿的稀疏多视角图像中重建由三维高斯参数化的三维场景的前向模型。我们的模型仅使用光度损失进行训练,实现了推理过程中实时的三维高斯重建。为消除重建过程中对准确位姿输入的需求,我们将一个输入视角的局部相机坐标系作为规范空间,并训练网络在该空间内为所有视角预测高斯基元。这种方法避免了将高斯基元从局部坐标系转换到全局坐标系的过程,从而避免了与每帧高斯和位姿估计相关的误差。为解决尺度模糊问题,我们设计并比较了多种内参嵌入方法,最终选择将相机内参转换为token嵌入,并将其与图像tokens作为模型输入进行拼接,从而实现准确的场景尺度预测。我们利用重建的三维高斯用于新视角合成和位姿估计任务,并提出了一个两阶段的粗到精位姿估计流程。实验结果表明,与需要位姿的模型相比,我们的无位姿方法在输入图像重叠有限的情况下可以实现更高质量的新视角合成。对于位姿估计任务,我们的方法在未使用真实深度数据或显式匹配损失的情况下,显著超越了最新的先进方法,取得了显著提升。本研究在无位姿泛化三维重建方面取得了重要进展,并展示了其在真实场景中的适用性。