You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using the default prompt in https://github.com/open-compass/VLMEvalKit/tree/main , I got 22.26 for Qwen-VL-Chat. Using different prompt or evaluation post-processing method can lead to large variance. Similar for Deepseek_vl_7b, I got 26.86 with the LLava-Next prompt provided, and 32.6 with the default prompt in VLMEvalKit.
Is there an evaluation pipeline for other models reported in the paper. I found it hard to replicate the exact number without the exact prompt.
Following set up in https://github.com/open-compass/VLMEvalKit/tree/main for Qwen set up.
Qwen-VL-Chat directly outputs answer instead of a letter choice. Did you use any customized prompt or did post processing of model responses?
The text was updated successfully, but these errors were encountered: