Please feel free to contact [email protected] if you have any question.
- We name this method in the paper DRA (Diversification along the Intersection Region of Adversarial Trajectory Attack) to facilitate comparison.
- We further improve DRA with the SA-AET method, achieving a significant enhancement in adversarial transferability. Please refer to the paper [2411.02669v1] Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack and the code SA-AET.
-
Why are the SGA results in the main text much higher than those in the SGA paper?
Answer: We adopted different perturbation settings, with 8/255 presented in the main text. The corresponding settings in the SGA paper can be found in Appendix Tables A5 and A6. If you need a comparison under the SGA perturbation setting of 2/255.
Recent research on Vision-language pre-training (VLP) models highlights their vulnerability to multimodal adversarial examples (AEs). The work SGA suggests that increasing AE diversity during optimization improves transferability but risks overfitting the victim model. We propose enhancing AE transferability by diversifying examples along the intersection region of the adversarial trajectory. By incorporating text-guided AE selection and directing adversarial text deviation from the last intersection region, we aim to eliminate overfitting and improve the transferability of multimodal adversarial examples.
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
Download the datasets, Flickr30k and MSCOCO (the annotations is provided in ./data_annotation/). Set the root path of the dataset in ./configs/Retrieval_flickr.yaml, image_root
.
The checkpoints of the fine-tuned VLP models is accessible in ALBEF, TCL, CLIP.
Prepare datasets:
You can download the datasets from this link. Or you can use the following instruction:
wget --no-check-certificate 'https://drive.usercontent.google.com/download?id=1zBQUcAccfWeXa7MKBSEyvpT-IWy7Z0_A&export=download&authuser=0&confirm=t&uuid=39b175db-9722-40ba-a2bb-3be50baea4e0&at=AO7h07dBMmYAaBoPQvdhLYW0reD2:1726986807692' -O datasets.tar.gz
Prepare checkpoints for models:
Please create a directory checkpoints
first, Then use the following instructions:
- ALBEF Pre-Trained on Flickr30K
wget https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/flickr30k.pth -O albef_flickr.pth
- ALBEF Pre-Trained on MSCOCO
wget https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/mscoco.pth -O albef_mscoco.pth
- TCL Pre-Trained on Flickr30K(Invalid Now)
wget --no-check-certificate 'https://drive.usercontent.google.com/download?id=1qwWfqyCu1F5YZqQNxjkqy1REESoU6pOT&export=download&authuser=0&confirm=t&uuid=ba76663a-9838-4508-a417-cc2ac52b6afe&at=APZUnTXrehkOzsP4JEi2iyyy03qw%3A1719988215602' -O tcl_flickr.pth
- TCL Pre-Trained on MSCOCO(Invalid Now)
wget --no-check-certificate 'https://drive.usercontent.google.com/download?id=1PtcZF_XzJgIceg4rXLWqGQiXjizvxxS6&export=download&authuser=0&confirm=t&uuid=c4664ba2-285a-47dc-9d9d-38fee6558997&at=APZUnTWkVgMXZCmEKxfQ-jE8PvV2%3A1719988395511' -O tcl_mscoco.pth
Common Issue: We found that the pretrained weights were removed from the TCL repository, so we uploaded the weights we had saved to the Hugging Face repository. Please download all the weights from the Hugging Face repository Sensen02/VLPTransferAttackCheckpoints · Hugging Face.
Our method has two adjustable hyperparameters. In RAttacker.py
, you can set the attribute sample_numbers
in the ImageAttacker
class, with a default value of 5. In the TextAttacker
class, you can set the attribute text_ratios
, with a default value of [0.6, 0.2, 0.2]
.
We provide eval.py
(You can choose to import SGAttacker or RAttacker(Ours) in eval.py
) for Image-Text Retrieval Attack Evaluation,Here are the annotations for the running parameters:
--config: the path for config file
--cuda_id: the id for gpu server
--model_list: all of evaluation VLP models, we provide ALBEF,TCL,CLIP(ViT&CNN)
--source_model: selected VLP models to generate multimodal adversarial examples
--albef_ckpt: the checkpoint for ALBEF
--tcl_ckpt: the checkpoint for TCL
Here is an example for Flickr30K dataset.
python eval.py --config ./configs/Retrieval_flickr.yaml \
--cuda_id 0 \
--model_list ['ALBEF','TCL','CLIP_ViT','CLIP_CNN'] \
--source_model CLIP_CNN \
--albef_ckpt ./checkpoints/albef_flickr.pth \
--tcl_ckpt ./checkpoints/tcl_flickr.pth \
--original_rank_index_path ./std_eval_idx/flickr30k/
Main Results
We present two cross-task attack evaluations, ITR->VG and ITR->IC.
ITR->VG:
First, please use the MSCOCO dataset and the provided files ./data_annotation/refcoco+_test_for_adv.json
and ./data_annotation/refcoco+_val_for_adv.json
to generate adversarial images(10K images).
After that, please refer to Grouding.py
(use '--evaluate') in ALBEF, and replace the clean images in the MSCOCO dataset with the adversarial images. Then, you can get the performance of the ALBEF model on the adversarial images, corresponding to the Val, TestA, and TestB metrics.
ITR->IC:
First, please use the MSCOCO dataset and the provided files ./data_annotation/coco_karpathy_test.json
and ./data_annotation/coco_karpathy_val.json
to generate adversarial images(3K images).
After that, please refer to train_caption.py
(use '--evaluate') in BLIP, and replace the clean images in the MSCOCO dataset with the adversarial images. Then, you can get the performance of the ALBEF model on the adversarial images, corresponding to the B@4, METEOR, ROUGE-L, CIDEr and SPICE metrics.
Main Results:
Please send the adversarial images to LLMs and prompt these systems with the query "Describe this image".
GPT-4
Claude-3
Kindly include a reference to this paper in your publications if it helps your research:
# DRA
@inproceedings{gao2025boosting,
title={Boosting transferability in vision-language attacks via diversification along the intersection region of adversarial trajectory},
author={Gao, Sensen and Jia, Xiaojun and Ren, Xuhong and Tsang, Ivor and Guo, Qing},
booktitle={European Conference on Computer Vision},
pages={442--460},
year={2025},
organization={Springer}
}
#SA-AET
@article{jia2024semantic,
title={Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack},
author={Jia, Xiaojun and Gao, Sensen and Guo, Qing and Ma, Ke and Huang, Yihao and Qin, Simeng and Liu, Yang and Cao, Xiaochun},
journal={arXiv preprint arXiv:2411.02669},
year={2024}
}