Language-Embedded Gaussian Splats (LEGS): Incrementally Building Room-Scale Representations with a Mobile Robot
Building semantic 3D maps is valuable for searching for objects of interest in offices, warehouses, stores, and homes. We present a mapping system that incrementally builds a Language-Embedded Gaussian Splat (LEGS): a detailed 3D scene representation that encodes both appearance and semantics in a unified representation. LEGS is trained online as a robot traverses its environment to enable localization of open-vocabulary object queries. We evaluate LEGS on 4 room-scale scenes where we query for objects in the scene to assess how LEGS can capture semantic meaning. We compare LEGS to LERF and find that while both systems have comparable object query success rates, LEGS trains over 3.5x faster than LERF. Results suggest that a multi-camera setup and incremental bundle adjustment can boost visual reconstruction quality in constrained robot trajectories, and suggest LEGS can localize open-vocabulary and long-tail object queries with up to 66% accuracy.
构建语义三维地图在办公室、仓库、商店和家庭中寻找目标物体时非常有价值。我们提出了一种映射系统,能够逐步构建语言嵌入高斯分布(Language-Embedded Gaussian Splat, LEGS):一种详细的三维场景表示,将外观和语义编码为统一表示。LEGS 在机器人行进过程中在线训练,使其能够定位开放词汇的物体查询。我们在四个房间规模的场景中评估了 LEGS,通过对场景中的物体进行查询,评估 LEGS 捕捉语义含义的能力。我们将 LEGS 与 LERF 进行了比较,发现虽然两个系统在物体查询成功率上相当,但 LEGS 的训练速度比 LERF 快3.5倍以上。结果表明,多摄像头设置和增量捆绑调整可以在受限的机器人轨迹中提高视觉重建质量,且 LEGS 能够在开放词汇和长尾物体查询中达到高达66%的准确率。