3D Vision-Language Gaussian Splatting

Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over-fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization. We also employ a camera-view blending technique to improve semantic consistency between existing and synthesized views, thereby effectively mitigating over-fitting. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-vocabulary semantic segmentation, surpassing existing methods by a significant margin.

近年来，3D重建方法和视觉语言模型的进展推动了多模态3D场景理解的发展，这在机器人、自动驾驶以及虚拟/增强现实等领域具有重要应用。然而，当前的多模态场景理解方法往往简单地将语义表示嵌入到3D重建方法中，未能在视觉和语言模态之间取得平衡，导致半透明或反射物体的语义光栅化效果不佳，并且过度依赖颜色模态。为了解决这些问题，我们提出了一种能够充分处理视觉和语义模态差异的解决方案，即一个用于场景理解的3D视觉语言高斯点模型，强调语言模态的表示学习。我们提出了一种新颖的跨模态光栅器，通过模态融合以及平滑语义指示器来增强语义光栅化效果。此外，我们采用了相机视角融合技术，以提高现有视图和合成视图之间的语义一致性，从而有效减轻过拟合问题。大量实验表明，我们的方法在开放词汇的语义分割任务中达到了最新的性能，显著超越了现有方法。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2410.07577.md

2410.07577.md

3D Vision-Language Gaussian Splatting

Files

2410.07577.md

Latest commit

History

2410.07577.md

File metadata and controls

3D Vision-Language Gaussian Splatting