On Evaluating Adversarial Robustness of
Large Vision-Language Models

[Project Page] | [Slides] | [arXiv] | [Data Repository]

TL, DR:

In this research, we evaluate the adversarial robustness of recent large vision-language (generative) models (VLMs), under the most realistic and challenging setting with threat model of black-box access and targeted goal.

Our proposed method aims for the targeted response generation over large VLMs such as MiniGPT-4, LLaVA, Unidiffuser, BLIP/2, Img2Prompt, etc.

In other words, we mislead and let the VLMs say what you want, regardless of the content of the input image query.

Requirements

Platform: Linux
Hardware: A100 PCIe 40G
lmdb, tqdm
wandb, torchvision, etc.

In our work, we used DALL-E, Midjourney and Stable Diffusion for the target image generation and demonstration. For the large-scale experiments, we apply Stable Diffusion for target image generation. To install Stable Diffusion, we init our conda environment following Latent Diffusion Models. A suitable base conda environment named ldm can be created and activated with:

conda env create -f environment.yaml
conda activate ldm

Note that for different victim models, we will follow their official implementations and conda environments.

Targeted Image Generation

As discussed in our paper, to achieve a flexible targeted attack, we leverage a pretrained text-to-image model to generate an targetd image given a single caption as the targeted text. Consequently, in this way you can specify the targeted caption for attack by yourself!

We use Stable Diffusion, DALL-E or Midjourney as the text-to-image generators in our experiments. Here, we use Stable Diffusion for demonstration (thanks for open-sourcing!).

Prepare the scripts

git clone https://github.com/CompVis/stable-diffusion.git
cd stable-diffusion

then, prepare the full targeted captions from MS-COCO, or download our processed and cleaned version:

https://drive.google.com/file/d/19tT036LBvqYonzI7PfU9qVi3jVGApKrg/view?usp=sharing

and move it to ./stable-diffusion/. In experiments, one can randomly sample a subset of COCO captions (e.g., 10, 100, 1K, 10K, 50K) for the adversarial attack. For example, lets assume we have randomly sampled 10K COCO captions as our targeted text c_tar and stored them in the following file:

https://drive.google.com/file/d/1e5W3Yim7ZJRw3_C64yqVZg_Na7dOawaF/view?usp=sharing

Generate the targeted images

The targeted images h_ξ(c_tar) can be obtained via Stable Diffusion by reading text prompt from the sampled COCO captions, with the script below and txt2img_coco.py (please move txt2img_coco.py to ./stable-diffusion/, note that hyperparameters can be adjusted with your preference):

python txt2img_coco.py \
        --ddim_eta 0.0 \
        --n_samples 10 \
        --n_iter 1 \
        --scale 7.5 \
        --ddim_steps 50 \
        --plms \
        --skip_grid \
        --ckpt ./_model_pool/sd-v1-4-full-ema.ckpt \
        --from-file './name_of_your_coco_captions_file.txt' \
        --outdir './path_of_your_targeted_images' \

where the ckpt is provided by Stable Diffusion v1 and can be downloaded here: sd-v1-4-full-ema.ckpt.

Additional implementation details of text-to-image generation by Stable Diffusion can be found HERE.

Adversarial Attack & Black-box Query

Overview of our AttackVLM strategy

Prepare the VLM scripts

There are two steps of adversarial attack for VLMs: (1) transfer-based attacking strategy and (2) query-based attacking strategy using (1) as initialization. For BLIP/BLIP-2/Img2Prompt Models, please refer to ./LAVIS_tool. Here, we use Unidiffuser for an example.

Example: Unidiffuser

Installation

git clone https://github.com/thu-ml/unidiffuser.git
cd unidiffuser
cp ../unidff_tool/* ./

then, create a suitable conda environment named unidiffuser following the steps HERE, and prepare the corresponding model weights (we use uvit_v1.pth as the weight of U-ViT).

Transfer-based attacking strategy

conda activate unidiffuser

bash _train_adv_img_trans.sh

the crafted adv images x_trans will be stored in dir of white-box transfer images specified in --output. Then, we perform image-to-text and store the generated response of x_trans. This can be achieved by:

python _eval_i2t_dataset.py \
        --batch_size 100 \
        --mode i2t \
        --img_path 'dir of white-box transfer images' \
        --output 'dir of white-box transfer captions' \

where the generated responses will be stored in dir of white-box transfer captions in .txt format. We will use them for pseudo-gradient estimation via RGF-estimator.

Query-based attacking strategy (via RGF-estimator): assume we use fixed perturbation budget for MF-ii + MF-tt (e.g., 8 px)

bash _train_trans_and_query_fixed_budget.sh

On the other hand, if you want to conduct transfer+query - based attack with separate perturbation budget, we additionally provide a script:

bash _train_trans_and_query_more_budget.sh

Evaluation

Here, we use wandb to dynamically monitor the moving average of the CLIP score (e.g., RN50, ViT-B/32, ViT-L/14, etc.) to evaluate the similarity between (a) the generated response (of trans/query images) and (b) the predefined targeted text c_tar.

An example shown as below, where the dotted line denotes the moving average of the CLIP score (of image captions) after query:

Meanwhile, the image caption after query will be stored and the directory can be specified by --output.

Bibtex

If you find this project useful in your research, please consider citing our paper:

@inproceedings{zhao2023evaluate,
  title={On Evaluating Adversarial Robustness of Large Vision-Language Models},
  author={Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Li, Chongxuan and Cheung, Ngai-Man and Lin, Min},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023}
}

Meanwhile, a relevant research that aims to Embedding a Watermark to (multi-modal) Diffusion Models:

@article{zhao2023recipe,
  title={A Recipe for Watermarking Diffusion Models},
  author={Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Cheung, Ngai-Man and Lin, Min},
  journal={arXiv preprint arXiv:2303.10137},
  year={2023}
}

Acknowledgement:

We appreciate the wonderful base implementation of MiniGPT-4, LLaVA, Unidiffuser, LAVIS and CLIP. We also thank @MetaAI for open-sourcing their LLaMA checkponts. We thank SiSi for providing some enjoyable and visual-pleasant images generated by @Midjourney in our research.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
LAVIS_tool		LAVIS_tool
MiniGPT-4		MiniGPT-4
assets		assets
unidiff_tool		unidiff_tool
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
eval_clip_text_score.py		eval_clip_text_score.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On Evaluating Adversarial Robustness of
Large Vision-Language Models

TL, DR:

Requirements

Targeted Image Generation

Prepare the scripts

Generate the targeted images

Adversarial Attack & Black-box Query

Overview of our AttackVLM strategy

Prepare the VLM scripts

Example: Unidiffuser

Evaluation

Bibtex

Acknowledgement:

About

Releases

Packages

Languages

License

yunqing-me/AttackVLM

Folders and files

Latest commit

History

Repository files navigation

On Evaluating Adversarial Robustness of Large Vision-Language Models

TL, DR:

Requirements

Targeted Image Generation

Prepare the scripts

Generate the targeted images

Adversarial Attack & Black-box Query

Overview of our AttackVLM strategy

Prepare the VLM scripts

Example: Unidiffuser

Evaluation

Bibtex

Acknowledgement:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

On Evaluating Adversarial Robustness of
Large Vision-Language Models

Packages