Neural radiance fields (NeRF) and 3D Gaussian Splatting (3DGS) are popular techniques to reconstruct and render photo-realistic images. However, the pre-requisite of running Structure-from-Motion (SfM) to get camera poses limits their completeness. While previous methods can reconstruct from a few unposed images, they are not applicable when images are unordered or densely captured. In this work, we propose ZeroGS to train 3DGS from hundreds of unposed and unordered images. Our method leverages a pretrained foundation model as the neural scene representation. Since the accuracy of the predicted pointmaps does not suffice for accurate image registration and high-fidelity image rendering, we propose to mitigate the issue by initializing and finetuning the pretrained model from a seed image. Images are then progressively registered and added to the training buffer, which is further used to train the model. We also propose to refine the camera poses and pointmaps by minimizing a point-to-camera ray consistency loss across multiple views. Experiments on the LLFF dataset, the MipNeRF360 dataset, and the Tanks-and-Temples dataset show that our method recovers more accurate camera poses than state-of-the-art pose-free NeRF/3DGS methods, and even renders higher quality images than 3DGS with COLMAP poses. Our project page is available at this https URL.
神经辐射场(Neural Radiance Fields, NeRF)和 3D 高斯投影(3D Gaussian Splatting, 3DGS)是重建和渲染逼真图像的热门技术。然而,这些方法通常需要先运行结构化运动恢复(Structure-from-Motion, SfM)以获取相机位姿,这限制了其应用的完整性。尽管现有方法能够从少量未配准的图像中进行重建,但当图像是无序的或密集捕获时,它们无法适用。 为了解决这一问题,我们提出了 ZeroGS,一种能够从数百张未配准且无序图像中训练 3DGS 的方法。我们的方法利用预训练的基础模型作为神经场景表示。由于预测点图的精度不足以进行准确的图像配准和高保真的图像渲染,我们通过从种子图像初始化并微调预训练模型来缓解这一问题。图像随后被逐步配准并添加到训练缓冲区中,从而进一步用于模型训练。 此外,我们提出通过最小化多视角下的点到相机射线一致性损失来优化相机位姿和点图。实验结果表明,在 LLFF 数据集、MipNeRF360 数据集和 Tanks-and-Temples 数据集上,我们的方法在相机位姿估计方面优于最先进的无位姿 NeRF/3DGS 方法,并且在图像渲染质量上甚至超越了使用 COLMAP 位姿的 3DGS 方法。