We fine-tune a unified visual instruction synthesizer that generates diverse tasks based on image-caption pairs across various domains.
The following steps reproduce our visual instruction synthesizer. Alternatively, you can skip these steps and download our synthesizer from AdaptLLM/visual-instruction-synthesizer.
Fine-tuning steps:
Coming soon...
We use the synthesizer to generate task triplets from image-caption pairs in the target domain, followed by consistency-based data filtering to enhance data quality.
The following steps reproduce our data. You can also skip them and download the resulting synthetic data (including image_caption_and_synthetic_task.json
and images
) from:
conda activate vllm
cd QA-Synthesizer/vllm_inference
SYNTHESIZER=AdaptLLM/visual-instruction-synthesizer # Path to the synthesizer
CONSISTENCY_CHECKER=meta-llama/Meta-Llama-3-8B # Language model for consistency checks
We have included a few data samples in this repository for a quick try:
IMAGE_CAPTION='../data_samples/image_caption_pairs.json' # Path to the image-caption pairs
IMAGE_FOLDER='../data_samples/images' # Path to the image folder
OUTPUT_DIR='../data_samples/' # Output directory for synthesized data
# Run synthesis with data parallelism; adjust CUDA devices as needed:
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
-
download the
image_caption_pairs.json
file andimages
from AdaptLLM/biomed-visual-instructions -
Then run
IMAGE_CAPTION="./biomed-visual-instructions/image_caption_pairs.json"
IMAGE_FOLDER="./biomed-visual-instructions/images"
OUTPUT_DIR="./biomed-visual-instructions"
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
-
download the
image_caption_pairs.json
file andimages
from AdaptLLM/food-visual-instructions -
Then run
IMAGE_CAPTION="./food-visual-instructions/image_caption_pairs.json"
IMAGE_FOLDER="./food-visual-instructions/images"
OUTPUT_DIR="./food-visual-instructions"
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
The synthesized output for single-stage post-training will be saved at: ${OUTPUT_DIR}/image_caption_and_synthetic_task.json