(ACL 2024) Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space
Paper: https://arxiv.org/abs/2402.16832
Webpage: https://claws-lab.github.io/projection-in-MLLMs/
GitHub: https://github.com/claws-lab/projection-in-MLLMs
Authors:
Gaurav Verma1,
Minje Choi1,
Kartik Sharma1,
Jamelle Watson-Daniels2,
Sejoon Oh1,
and Srijan Kumar1
Affiliations: 1Georgia Institute of Technology, 2Harvard University
The codebase is built on top of LLaVA's codebase. Clone the repository from here: https://github.com/haotian-liu/LLaVA inside ./experiments/
-- and name the directory llava-ft
. Then, follow the instruction provided in the original repository to setup the environment. Once the setup is complete, to verify the installation, check if everything works by running the llava-v1.5-7b
model using the following command inside ./experiments/llava-ft
directory:
python -m llava.serve.cli \
--model-path liuhaotian/llava-v1.5-7b \
--image-file "https://llava-vl.github.io/static/images/view.jpg" \
--load-4bit
Additionally, make sure that the mm_projection.bin
corresponding to the llava-v1.5-7b model is downloaded from the following link: https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main . To use any other LLaVa-1.5 variant, explore the Model Zoo in the original repository.
We use
- Agriculture: Download the PlantDoc dataset from here and use the standard train-test split.
- Textures: Download the DTD dataset from here and use the standard train-test split (
train1.txt
andtest1.txt
). - Dermatology: Download the DermaNet dataset from here and use the standard train-test split.
- Humanitarian: Download the CrisisMMD dataset from here (version v2.0) and use the standard train-test split.
Prepare the dataset for fine-tuning the llava models using the script: ./prepare_data/format_data_for_finetuning.py
.
This will output a CSV file containing the image paths and labels for the images within the specified directory. This CSV will be used for zero-shot inference with the CLIP model.
Additionally, this script will output a JSON which will be used for fine-tuning the llava models.
The code for the experiments is available in the experiments
directory. The experiments
directory contains the following subdirectories:
-
./experiments/clip-zs
: Contains the code for the zero-shot experiments using CLIP. Run the zero-shot experiment usingpython zero_shot_inference.py
after specifying the .csv file containing the image paths and labels from the test set. This file would be obtained as a result of running theformat_data_for_finetuning.py
script. -
./experiments/llava-ft
: folder containing the experiments forllava-v1.5-7b
model. There are two fine-tuning strategies:- Fine-tuning the projection layer while keeping the LLM frozen. This corresponds to running the following command:
bash experiments/llava-ft/scripts/v1_5/pretrain.sh
- Modify the relevant paths in the
pretrain.sh
script to point to the correct base models (llava-v1.5-7b
), correct data_path (i.e., the JSON file obtained above), the image directory, and the output directory (which will store the updated projector). The set hyper-parameter values will work seamlessly with 2 A100 (80 GB) GPUs. - Once the
mm_projector.bin
is updated, it will be stored in the specified output directory. - Following this, the updated mm_projector.bin can be merged with your based model (i.e.,
llava-v1.5-7b
) using the bash script inside./experiments/llava-ft/merge_proj/
.
bash ./merge_proj/update_model.sh <source_model_path> <updated_projector_path> <save_merged_model_path>
-
Following these operations, run the zero-shot inference using the updated model (stored in
<save_merged_model_path>
) using thecli.py
script inside./experiments/llava-ft/llava/serve/
. -
Fine-tuning the entire model. This corresponds to running the following command:
bash experiments/llava-ft/scripts/v1_5/finetune_task.sh
- Modify the relevant paths in the
finetune_task.sh
script to point to the correct base models (llava-v1.5-7b
), correct data_path (i.e., the JSON file obtained above), the image directory, and the output directory. The set hyper-parameter values will work seamlessly with 2 A100 (80 GB) GPUs. - Once the model is fine-tuned, run the zero-shot inference using the updated model using the
cli.py
script inside./experiments/llava-ft/llava/serve/
. No need for merging the updated model with the base model in this case.
-
./experiments/estimte_richness/
contains the code for training MLPs on the pre- and post- projection representations of the images. Adjust the hyper-parameters in thetrain_mlp.py
and run the script to train the MLPs.
If you use this codebase, please cite our paper:
@article{verma2024crossmodalprojection,
title={Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space},
author={Verma, Gaurav and Choi, Minje and Sharma, Kartik and Watson-Daniels, Jamelle and Oh, Sejoon and Kumar, Srijan},
publisher={62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2024}
}
The codebase is built on top of LLaVA's codebase. We thank the authors for making the codebase publicly available. Relevant citations:
@misc{liu2023improvedllava,
title={Improved Baselines with Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
publisher={arXiv:2310.03744},
year={2023},
}
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={NeurIPS},
year={2023},
}