Skip to content

Latest commit

 

History

History
6 lines (4 loc) · 2.37 KB

2412.13193.md

File metadata and controls

6 lines (4 loc) · 2.37 KB

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents.

3D语义占用预测是空间理解的基础,因为它能够提供对周围环境的全面语义认知。然而,目前流行的方法主要依赖大量标注数据和计算密集的基于体素的建模,这限制了3D表示学习的可扩展性和通用性。在本文中,我们提出了 GaussTR,一种新颖的高斯Transformer,通过与基础模型的对齐推进自监督的3D空间理解。GaussTR 采用 Transformer 架构,以前馈方式预测表示场景的稀疏3D高斯集合。通过将渲染的高斯特征与预训练基础模型的多样化知识对齐,GaussTR 促进了多功能3D表示的学习,并在没有显式标注的情况下实现了开放词汇的占用预测。 在 Occ3D-nuScenes 数据集上的实证评估表明,GaussTR 实现了最先进的零样本性能,以 11.70 mIoU 的结果领先,同时训练时间减少了约50%。这些实验结果展示了 GaussTR 在可扩展和整体性3D空间理解方面的显著潜力,并在自动驾驶和智能体领域具有重要的应用前景。