Below we provide instructions for training and inference on audio and vision-language tasks. Pretrained and finetuned checkpoints are provided in checkpoints.md.
We recommend that your workspace directory should be organized like this:
ONE-PEACE/
├── assets/
├── fairseq/
├── one_peace/
│ ├── checkpoints
│ │ ├── one-peace.pt
│ ├── criterions
│ ├── data
│ ├── dataset
│ │ ├── esc50/
│ │ ├── flickr30k/
│ ├── metrics
│ └── ...
├── .gitignore
├── LICENSE
├── README.md
├── checkpoints.md
├── datasets.md
├── requirements.txt
Please note that if your device does not support bf16 precision, you can switch to fp16 precision for fine-tuning or inference.
common:
# # use bf16
# fp16: false
# memory_efficient_fp16: false
# bf16: true
# memory_efficient_bf16: true
# use fp16
fp16: true
memory_efficient_fp16: true
bf16: false
memory_efficient_bf16: false
The overall pretraining process of ONE-PEACE is divided into two stages: vision-language pretraining and audio-language pretraining.
Here we provide an example of vision-language pretraining.
- Download COCO. You can also replace COCO with your own datasets.
- Pretraining
cd one_peace/run_scripts/pretrain
bash pretrain_vl_3B.sh
At the audio-language pretraining stage, we initialized the model with the pretrained checkpoint of vision-language pretraining, and trains the model with audio-text pairs.
- Download AudioCaps, Clotho and MACS. You can also prepare your own datasets.
- Pretraining. Remember to load the pretrained checkpoint of vision-language pretraining
cd one_peace/run_scripts/pretrain
bash pretrain_al_3B.sh
- Download ESC-50
- Inference
cd one_peace/run_scripts/esc50
bash zero_shot_evaluate.sh
cd one_peace/run_scripts/image_text_retrieval
bash finetune_coco.sh
bash finetune_flickr.sh
- Inference
cd one_peace/run_scripts/image_text_retrieval
bash zero_shot_evaluate_coco.sh # zero-shot retrieval for COCO
bash zero_shot_evaluate_flickr.sh # zero-shot retrieval for Flickr30K
bash evaluate_coco.sh # evaluation for COCO
bash evaluate_flickr.sh # evaluation for Flickr30K
- Download NLVR2
- Finetuning
cd one_peace/run_scripts/nlvr2
bash finetune.sh
- Inference
cd one_peace/run_scripts/nlvr2
bash evaluate.sh
cd one_peace/run_scripts/visual_grounding
bash finetune_refcoco.sh
bash finetune_refcoco+.sh
bash finetune_refcocog.sh
- Inference
cd one_peace/run_scripts/visual_grounding
bash evaluate_refcoco.sh # evaluation for RefCOCO
bash evaluate_refcoco+.sh # evaluation for RefCOCO+
bash evaluate_refcocog.sh # evaluation for RefCOCOg
- Download VQAv2
- Finetuning
cd one_peace/run_scripts/vqa
bash finetune.sh
- Inference
cd one_peace/run_scripts/vqa
bash evaluate.sh
cd one_peace/run_scripts/audio_text_retrieval
bash finetune.sh
- Inference
cd one_peace/run_scripts/audio_text_retrieval
bash evaluate.sh
- Download AVQA
- Finetuning
cd one_peace/run_scripts/aqa
bash finetune.sh
- Inference
cd one_peace/run_scripts/aqa
bash evaluate.sh
- Download FSD50K
- Finetuning
cd one_peace/run_scripts/fsd50k
bash finetune.sh
- Inference
cd one_peace/run_scripts/fsd50k
bash evaluate.sh
- Download Vggsound
- Finetuning
cd one_peace/run_scripts/vggsound
bash finetune.sh
- Inference
cd one_peace/run_scripts/vggsound
bash evaluate.sh