Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over-fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization. We also employ a camera-view blending technique to improve semantic consistency between existing and synthesized views, thereby effectively mitigating over-fitting. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-vocabulary semantic segmentation, surpassing existing methods by a significant margin.
近年来,3D重建方法和视觉语言模型的进展推动了多模态3D场景理解的发展,这在机器人、自动驾驶以及虚拟/增强现实等领域具有重要应用。然而,当前的多模态场景理解方法往往简单地将语义表示嵌入到3D重建方法中,未能在视觉和语言模态之间取得平衡,导致半透明或反射物体的语义光栅化效果不佳,并且过度依赖颜色模态。为了解决这些问题,我们提出了一种能够充分处理视觉和语义模态差异的解决方案,即一个用于场景理解的3D视觉语言高斯点模型,强调语言模态的表示学习。我们提出了一种新颖的跨模态光栅器,通过模态融合以及平滑语义指示器来增强语义光栅化效果。此外,我们采用了相机视角融合技术,以提高现有视图和合成视图之间的语义一致性,从而有效减轻过拟合问题。大量实验表明,我们的方法在开放词汇的语义分割任务中达到了最新的性能,显著超越了现有方法。