Pre-training on large-scale unlabeled datasets contribute to the model achieving powerful performance on 3D vision tasks, especially when annotations are limited. However, existing rendering-based self-supervised frameworks are computationally demanding and memory-intensive during pre-training due to the inherent nature of volume rendering. In this paper, we propose an efficient framework named GS3 to learn point cloud representation, which seamlessly integrates fast 3D Gaussian Splatting into the rendering-based framework. The core idea behind our framework is to pre-train the point cloud encoder by comparing rendered RGB images with real RGB images, as only Gaussian points enriched with learned rich geometric and appearance information can produce high-quality renderings. Specifically, we back-project the input RGB-D images into 3D space and use a point cloud encoder to extract point-wise features. Then, we predict 3D Gaussian points of the scene from the learned point cloud features and uses a tile-based rasterizer for image rendering. Finally, the pre-trained point cloud encoder can be fine-tuned to adapt to various downstream 3D tasks, including high-level perception tasks such as 3D segmentation and detection, as well as low-level tasks such as 3D scene reconstruction. Extensive experiments on downstream tasks demonstrate the strong transferability of the pre-trained point cloud encoder and the effectiveness of our self-supervised learning framework. In addition, our GS3 framework is highly efficient, achieving approximately 9× pre-training speedup and less than 0.25× memory cost compared to the previous rendering-based framework Ponder.
大规模无标注数据集的预训练有助于模型在 3D 视觉任务中实现强大的性能,尤其是在标注有限的情况下。然而,现有基于渲染的自监督框架由于体渲染的固有特性,在预训练过程中通常计算开销大且内存占用高。本文提出了一种高效框架,名为 GS3,用于学习点云表示,该框架将快速的 3D 高斯点绘制(Gaussian Splatting)无缝集成到基于渲染的框架中。 该框架的核心思想是通过比较渲染的 RGB 图像和真实 RGB 图像,来预训练点云编码器,因为只有富含几何和外观信息的高斯点才能生成高质量的渲染结果。具体来说,我们将输入的 RGB-D 图像反投影到 3D 空间中,并使用点云编码器提取逐点特征。随后,从学习到的点云特征中预测场景的 3D 高斯点,并使用基于网格的光栅化器进行图像渲染。最后,预训练的点云编码器可以被微调,用于适配各种下游 3D 任务,包括高层次感知任务(如 3D 分割和检测)以及低层次任务(如 3D 场景重建)。 在下游任务上的大量实验表明,预训练点云编码器具有很强的迁移能力,而我们的自监督学习框架也非常高效。此外,GS3 框架在预训练速度上实现了约 9 倍加速,内存成本仅为之前基于渲染的框架 Ponder 的 0.25 倍以下。