Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.
视觉与语言导航(Vision-and-Language Navigation, VLN)任务中,智能体需要根据指令导航到目标地点。与具有预定义轨迹的离散环境中的导航不同,连续环境中的 VLN(VLN-CE)面临更大的挑战,因为智能体可以自由导航到任何未受阻的地点,同时更容易受到视觉遮挡或盲区的影响。近期方法尝试通过预测未来视觉图像或语义特征来“想象”未来环境,而不仅依赖当前观测。然而,这些基于 RGB 和特征的方法缺乏直观的外观级信息或高层次的语义复杂性,这对于有效导航至关重要。 为克服这些局限性,我们提出了一种新颖的、具有泛化能力的基于 3DGS 的预训练范式,称为 UnitedVLN。该方法通过联合渲染高保真的 360 度视觉图像和语义特征,使智能体能够更好地探索未来环境。UnitedVLN 包括两个关键机制:搜索-查询采样(search-then-query sampling)和分离-联合渲染(separate-then-united rendering),以高效利用神经基元,帮助集成外观和语义信息,从而实现更稳健的导航。 大量实验结果表明,UnitedVLN 在现有 VLN-CE 基准测试中表现优于最先进的方法,显著提高了导航的准确性和鲁棒性。