Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs. Most existing methods adopt dense representations (e.g., bird's eye view) or sparse representations (e.g., instance boxes) for decision-making, which suffer from the trade-off between comprehensiveness and efficiency. This paper explores a Gaussian-centric end-to-end autonomous driving (GaussianAD) framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene. We initialize the scene with uniform 3D Gaussians and use surrounding-view images to progressively refine them to obtain the 3D Gaussian scene representation. We then use sparse convolutions to efficiently perform 3D perception (e.g., 3D detection, semantic map construction). We predict 3D flows for the Gaussians with dynamic semantics and plan the ego trajectory accordingly with an objective of future scene forecasting. Our GaussianAD can be trained in an end-to-end manner with optional perception labels when available. Extensive experiments on the widely used nuScenes dataset verify the effectiveness of our end-to-end GaussianAD on various tasks including motion planning, 3D occupancy prediction, and 4D occupancy forecasting.
基于视觉的自动驾驶因其出色的性能和低成本展现了巨大潜力。目前大多数方法采用密集表示(如鸟瞰视图)或稀疏表示(如实例框)进行决策,这在全面性和效率之间存在权衡。本文提出了一种以高斯为中心的端到端自动驾驶框架(GaussianAD),利用3D语义高斯实现对场景的广泛且稀疏的描述。我们使用均匀分布的3D高斯初始化场景,并通过周围视角的图像逐步细化,生成3D高斯场景表示。随后,我们利用稀疏卷积高效地执行3D感知任务(如3D检测和语义地图构建)。 我们针对具有动态语义的高斯预测3D流动,并以未来场景预测为目标规划自车轨迹。GaussianAD 可以采用端到端的方式进行训练,并在可用时利用可选的感知标签。在广泛使用的 nuScenes 数据集上的实验表明,GaussianAD 在运动规划、3D占用预测以及4D占用预测等多项任务中表现出色,验证了其端到端方法的有效性。