Read Before Grounding: Scene Knowledge Visual Grounding via Multi-step parsing

This is the official code implement of COLING 2025 paper Read Before Grounding: Scene Knowledge Visual Grounding via Multi-step parsing

Data Preparation

Unzip the file to the current folder after the data download is complete

First, you should generate the visual descriptor:

python qwen_api.py # you may need adjust the data path

then you could use these visual descriptors evaluate multimodal models.

python qwen_api_baseline.py # you may need adjust the data path

python glm4_flash.py # you may need adjust the data path

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
glm4_flash.py		glm4_flash.py
paper_logo.png		paper_logo.png
qwen_api.py		qwen_api.py
qwen_api_baseline.py		qwen_api_baseline.py
requirements.txt		requirements.txt