Preparations

Code for the preprint: Schumann et al, "VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View"

Project page: https://map2seq.schumann.pub/vln/velma/

Preparations

Download CLIP embeddings for images and landmarks from: here or here
Extract all files and move them into the 'features/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k' folder.

Tested with Python 3.10
Install packages in requirements.txt

In general all default arguments in the run_+ scripts should be correct.

Inference

Reproduce GPT-3 and GPT-4 Results in Paper:

python run_inference_gpt.py --num_instances -1 --dataset_name touchdown --split dev --exp_name github --image openclip --model_name openai/text-davinci-003
python run_inference_gpt.py --num_instances -1 --dataset_name map2seq --split dev --exp_name github --image openclip --model_name openai/text-davinci-003
python run_inference_gpt.py --num_instances -1 --dataset_name touchdown --split dev --exp_name github --image openclip --model_name openai/gpt-4-0314
python run_inference_gpt.py --num_instances -1 --dataset_name map2seq --split dev --exp_name github --image openclip --model_name openai/gpt-4-0314

All API calls to GPT-3 and GPT4 with using openclip features should be cached. Otherwise be careful with how many instances you run, they can get expensive quickly. There are up to 40 API calls per instance with ~2000 tokens each!

Reproduce LLaMa-1, LLaMa-2, OPT Results in Paper:

python run_inference_llama.py --num_instances -1 --dataset_name map2seq --split dev --model_name decapoda-research/llama-7b-hf
python run_inference_llama.py --num_instances -1 --dataset_name map2seq --split dev --model_name meta-llama/Llama-2-7b-hf --hf_auth_token {huggingface key}
python run_inference_llama.py --num_instances -1 --dataset_name map2seq --split dev --model_name facebook/opt-1.3b

Reproduce VELMA-FT and VELMA-RBL Results in Paper:

python run_inference_ft.py --weights_dir weights/VELMA-FT-touchdown/ --dataset_name touchdown --splits dev test
python run_inference_ft.py --weights_dir weights/VELMA-FT-map2seq/ --dataset_name map2seq --splits dev test
python run_inference_ft.py --weights_dir weights/VELMA-RBL-touchdown/ --dataset_name touchdown --splits dev test
python run_inference_ft.py --weights_dir weights/VELMA-RBL-map2seq/ --dataset_name map2seq --splits dev test

Finetuning

Regular finetuning:

python run_finetune.py --exp_name github_openclip_seed1 --dataset_name map2seq --image openclip
python run_finetune.py --exp_name github_openclip_seed1 --dataset_name touchdown --image openclip

Finetuning with response based learning:

python run_finetune_rl.py --exp_name github_openclip_seed1 --dataset_name map2seq --image openclip
python run_finetune_rl.py --exp_name github_openclip_seed1 --dataset_name touchdown --image openclip

References

Code based on https://github.com/VegB/VLN-Transformer and https://github.com/raphael-sch/map2seq_vln
Touchdown splits based on: https://github.com/lil-lab/touchdown
map2seq splits based on: https://map2seq.schumann.pub
Panorama images can be downloaded here: https://sites.google.com/view/streetlearn/dataset

Citation

Please cite the following paper if you use this code:

@article {schumann-2023-velma,
 title = "VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View",
 author = "Raphael Schumann and Wanrong Zhu and Weixi Feng and Tsu-Jui Fu and Stefan Riezler and William Yang Wang",
 year = "2023",
 publisher = "arXiv",
 eprint = "2307.06082" 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
features		features
llm		llm
outputs		outputs
outputs_ft_inference		outputs_ft_inference
vln		vln
weights		weights
README.md		README.md
navigation.gif		navigation.gif
requirements.txt		requirements.txt
run_baselines.py		run_baselines.py
run_evaluation.py		run_evaluation.py
run_extract_landmarks.py		run_extract_landmarks.py
run_finetune.py		run_finetune.py
run_finetune_rl.py		run_finetune_rl.py
run_inference_ft.py		run_inference_ft.py
run_inference_gpt.py		run_inference_gpt.py
run_inference_llama.py		run_inference_llama.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preparations

Inference

Reproduce GPT-3 and GPT-4 Results in Paper:

Reproduce LLaMa-1, LLaMa-2, OPT Results in Paper:

Reproduce VELMA-FT and VELMA-RBL Results in Paper:

Finetuning

References

Citation

About

Languages

raphael-sch/VELMA

Folders and files

Latest commit

History

Repository files navigation

Preparations

Inference

Reproduce GPT-3 and GPT-4 Results in Paper:

Reproduce LLaMa-1, LLaMa-2, OPT Results in Paper:

Reproduce VELMA-FT and VELMA-RBL Results in Paper:

Finetuning

References

Citation

About

Resources

Stars

Watchers

Forks

Languages