The Bird's-eye View (BeV) representation is widely used for 3D perception from multi-view camera images. It allows to merge features from different cameras into a common space, providing a unified representation of the 3D scene. The key component is the view transformer, which transforms image views into the BeV. However, actual view transformer methods based on geometry or cross-attention do not provide a sufficiently detailed representation of the scene, as they use a sub-sampling of the 3D space that is non-optimal for modeling the fine structures of the environment. In this paper, we propose GaussianBeV, a novel method for transforming image features to BeV by finely representing the scene using a set of 3D gaussians located and oriented in 3D space. This representation is then splattered to produce the BeV feature map by adapting recent advances in 3D representation rendering based on gaussian splatting. GaussianBeV is the first approach to use this 3D gaussian modeling and 3D scene rendering process online, i.e. without optimizing it on a specific scene and directly integrated into a single stage model for BeV scene understanding. Experiments show that the proposed representation is highly effective and place GaussianBeV as the new state-of-the-art on the BeV semantic segmentation task on the nuScenes dataset.
鸟瞰图(BeV)表示法被广泛用于多视角相机图像的3D感知。它允许将不同相机的特征合并到一个公共空间中,为3D场景提供统一的表示。其关键组件是视图转换器,它将图像视图转换为BeV。然而,基于几何或交叉注意力的现有视图转换器方法无法提供足够详细的场景表示,因为它们使用的3D空间子采样对于建模环境的精细结构来说并不理想。在本文中,我们提出了GaussianBeV,这是一种新颖的方法,通过使用一组位于3D空间中并具有特定朝向的3D高斯分布来精细地表示场景,从而将图像特征转换为BeV。然后,通过改编最近基于高斯喷溅的3D表示渲染进展,将这种表示喷溅生成BeV特征图。GaussianBeV是第一个在线使用这种3D高斯建模和3D场景渲染过程的方法,即无需针对特定场景进行优化,而是直接集成到单阶段模型中用于BeV场景理解。实验表明,所提出的表示方法非常有效,使GaussianBeV成为nuScenes数据集上BeV语义分割任务的新的最先进技术。