This repository is the official implementation of Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM.
1. Use anaconda to create a Python 3.8 environment:
conda create -n vln python3.8
conda activate vln
2. Install CLIP:
pip install git+https://github.com/openai/CLIP.git
pip install -r python_requirements.txt
4. Install Matterport3D simulators (v0.1):
sudo apt-get install libjsoncpp-dev libepoxy-dev libglm-dev libosmesa6 libosmesa6-dev libglew-dev
mkdir build && cd build
cmake -DEGL_RENDERING=ON ..
make -j8
6. Download the CLIP model (optional)
Download the CLIP model here and place it under img_features/ or run the script to automatically download the model
7. Download ChatGLM-6B model (optional)
Download the ChatGLM-6B model here for online instruction decomposition, but this is not necessary because the instructions have been pre-processed and stored in the json file under tasks/R2R/data/
bash run/agent.bash 0
0 is the id of GPU. It will train the agent and save the snapshot under snap/agent/.
After training the agent, test the agent by
bash run/test_agent.bash 0
0 is the id of GPU. It will load trained agent and test it on test set.
We opt for a simple reward function for two main reasons. First, this reward design sufficiently supports the agent in learning effective policies and facilitates fair comparisons with existing methods. Second, we aim to minimize task-specific customizations to maintain the model's generalizability; overly complex or inaccurate rewards could diminish the model's performance.
Methods | Validation Seen | Validation Unseen |
---|---|---|
NL↓ NE↓ SR↑ SPL↑ | NL↓ NE↓ SR↑ SPL↑ | |
DILLM-VLN | 12.8 4.74 57.2 0.51 | 11.4 5.31 49.4 0.44 |
+ SGS | 12.3 5.15 53.5 0.48 | 12.8 5.37 47.5 0.41 |
+ OGS | 11.8 5.27 52.2 0.47 | 11.7 5.66 46.1 0.40 |
We have incorporated the scene grounding score (SGS, which assesses if the agent has reached the scene described by the sub-instruction) and the object grounding score (OGS, which determines if the agent has found the target object described in the sub-instruction) into the reward function. The above table presents the experiment results, showing a decline in navigation performance with the addition of SGS and OGS. This indicates that the design of the reward function directly influences the learning objectives of the agent. Our task design already decomposes the navigation task into multiple simple sub-instruction, focusing the agent on completing each sub-instruction sequentially. The additional reward signals introduce unnecessary distractions, hindering the agent's learning of efficient navigation policies.
If you find this work helpful, please consider citing:
@article{wang2024boosting,
title={Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM},
author={Wang, Jiawei and Wang, Teng and Cai, Wenzhe and Xu, Lele and Sun, Changyin},
journal={IEEE Robotics and Automation Letters},
year={2024},
doi={10.1109/LRA.2024.3511402}
}