Skip to content

Latest commit

 

History

History
7 lines (5 loc) · 2.87 KB

2412.00734.md

File metadata and controls

7 lines (5 loc) · 2.87 KB

ChatSplat: 3D Conversational Gaussian Splatting

Humans naturally interact with their 3D surroundings using language, and modeling 3D language fields for scene understanding and interaction has gained growing interest. This paper introduces ChatSplat, a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space. Unlike existing methods that primarily use CLIP-derived language features focused solely on segmentation, ChatSplat facilitates interaction on three levels: objects, views, and the entire 3D scene. For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model (LLM) for conversation. At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene. For object-level interaction, ChatSplat uses a patch-wise language embedding, unlike LangSplat's pixel-wise language embedding that implicitly includes mask and embedding. Here, we explicitly decouple the language embedding into separate mask and feature map representations, allowing more flexible object-level interaction. To address the challenge of learning 3D Gaussians posed by the complex and diverse distribution of language embeddings used in the LLM, we introduce a learnable normalization technique to standardize these embeddings, facilitating effective learning. Extensive experimental results demonstrate that ChatSplat supports multi-level interactions -- object, view, and scene -- within 3D space, enhancing both understanding and engagement.

人类自然地通过语言与三维环境交互,而针对场景理解和交互的三维语言场建模正引起越来越多的关注。本文介绍了ChatSplat,这是一种构建三维语言场的系统,能够在三维空间中实现丰富的基于对话的交互。与现有主要使用基于CLIP的语言特征并仅专注于分割的方式不同,ChatSplat在三个层次上实现交互:对象、视角和整个三维场景。 在视角层次,ChatSplat设计了一种编码器,用于将每个视角的渲染特征图编码为令牌,这些令牌随后由大型语言模型(LLM)处理以支持对话。在场景层次,ChatSplat结合了多视角令牌,实现了考虑整个场景的交互。在对象层次,ChatSplat采用了基于patch的语言嵌入,与LangSplat的基于像素的语言嵌入(隐式包含掩码和嵌入)不同,这里明确地将语言嵌入解耦为单独的掩码和特征图表示,从而实现更灵活的对象级交互。 针对LLM中语言嵌入复杂多样分布对三维高斯学习带来的挑战,我们引入了一种可学习的归一化技术,用于标准化这些嵌入,从而促进高效学习。大量实验结果表明,ChatSplat支持三维空间中的多层次交互(对象、视角和场景),显著增强了场景理解和交互体验。