The emergence of neural representations has revolutionized our means for digitally viewing a wide range of 3D scenes, enabling the synthesis of photorealistic images rendered from novel views. Recently, several techniques have been proposed for connecting these low-level representations with the high-level semantics understanding embodied within the scene. These methods elevate the rich semantic understanding from 2D imagery to 3D representations, distilling high-dimensional spatial features onto 3D space. In our work, we are interested in connecting language with a dynamic modeling of the world. We show how to lift spatio-temporal features to a 4D representation based on 3D Gaussian Splatting. This enables an interactive interface where the user can spatiotemporally localize events in the video from text prompts. We demonstrate our system on public 3D video datasets of people and animals performing various actions.
神经表示的出现彻底改变了我们数字化观看各种3D场景的方式,使得从新视角合成照片级真实感的图像成为可能。近期,一些技术被提出,用于将这些低级表示与场景中的高级语义理解相连接。这些方法将来自2D图像的丰富语义理解提升至3D表示,通过在3D空间上蒸馏高维空间特征。在我们的工作中,我们关注如何将语言与动态世界的建模相结合。我们展示了如何基于3D高斯散射将时空特征提升至4D表示,这使得用户可以通过文本提示在视频中时空定位事件。我们在公共的3D视频数据集上展示了系统的效果,这些数据集包含了人类和动物执行各种动作的场景。