Official PyTorch implementation of the paper "HowToCaption: Prompting LLMs to Transform Video Annotations at Scale", ECCV 2024.
by Nina Shvetsova*, Anna Kukleva*, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne.
We release the HowToCaption dataset. Check the dataset readme to download it.
The HowToCaption dataset comprises 1.2M long-term instructional videos from the HowTo100M dataset, where ASR subtitles have been transformed into proper captions via our HowToCaption method using the Vicuna-13B LLM (v0). The captions are automatically generated and their high-quality alignment to the video are further ensured through subsequent alignment and filtering post-processing, all achieved without any human involvement. As a result, the HowToCaption dataset contains 25M aligned video-text pairs.
Using the proposed HowToCaption dataset, we pretrained video-language models (initialized from the image-text BLIP model): All checkpoints are available here.
Method | Model size | Dataset | YouCook2 | MSRVTT | ||
---|---|---|---|---|---|---|
R1 | R10 | R1 | R10 | |||
Dual Encoder Model | ViT-B | HowTo100M | 12.2 | 39.3 | 30.8 | 61.7 |
Dual Encoder Model | ViT-B | WebVid2M | 7.3 | 29.0 | 38.5 | 71.9 |
Dual Encoder Model | ViT-B | HowToCaption | 13.4 | 44.1 | 37.6 | 73.3 |
Full Model (with re-ranking) | ViT-B | HowToCaption | 18.15 | 50.4 | 44.3 | 76.6 |
Full Model (with re-ranking) | ViT-L | HowToCaption | 19.9 | 53.2 | 45.2 | 77.8 |
Full Model (ViT-B) fine-tuned for video captioning:
Dataset | BLEU@4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|
YouCook2 | 8.8 | 15.9 | 37.3 | 116.4 |
MSRVTT | 49.8 | 32.2 | 66.3 | 65.3 |
MSVD | 70.4 | 46.4 | 83.2 | 154.2 |
We also release weights for the fine-tuned VAST ViT-L model: weights.
conda create python=3.8 -y -n howtocaption
conda activate howtocaption
conda install -y pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
pip install -e .
Preprocess data into the data/howto100m
folder:
- Link the folder to the videos of HowTo100M in
data/howto100m/videos
- Create a CSV file
data/howto100m/video_path_downloaded.csv
with video_id and video_path correspondences (path should be relative to the folderdata/howto100m/videos
). For example:
video_id, video_path
RoupYOneCIo,food_and_entertaining/video_R/RoupYOneCIo.webm
Zx3_yGY_ERs, food_and_entertaining/video_Z/Zx3_yGY_ERs.mp4
Follow the dataset readme to download the HowToCaption dataset and store it in data/howtocaption
.
Follow CLIP4CLIP guidelines to download the MSRVTT, MSVD, and LSMDC datasets.
Follow the YouCook2 guidelines to download YouCook2. Put datasets in corresponding folders:
data/msrvtt
, data/msvd
, data/lsmdc
, data/youcook2
.
This repository uses YAML files to keep all hyperparameters. The configs
folder contains configs for LLM prompting, vision-language model
training, and evaluation.
This repository uses Sacred with neptune.ai for logging and tracking experiments. If you want to activate this:
- Create a neptune.ai account (you may ask for an academic account if applicable).
- Create a project, and copy your credentials (api_token, project name) in
train.py
. - Add the --neptune key to
train.py
.
Evaluate video retrieval (without re-ranking):
export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost
torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
howtocaption/eval.py \
--resume pretrained/dual_encoder_retrieval.pth\
-c configs/VL_training/dual_encoder_retrieval.yaml \
--distributed 1 \
--world_size ${WORLD_SIZE} \
--eval_retrieval
Evaluate video captioning:
export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost
torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
howtocaption/eval.py \
--resume pretrained/captioning_msrvtt.pth\
-c configs/VL_training/captioning_msrvtt.yaml \
--distributed 1 \
--world_size ${WORLD_SIZE} \
--eval_captioning
See more configs in configs/
and models here.
For retrieval evaluation with re-ranking, we followed the VAST implementation.
Train Dual-Encoder Model (initialized from BLIP) on HowToCaption Dataset:
export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost
torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
howtocaption/train.py \
-c configs/VL_training/dual_encoder_retrieval.yaml \
--distributed 1 \
--world_size ${WORLD_SIZE}
Train Full Encoder-Decoder Model (initialized from BLIP) on HowToCaption Dataset:
export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost
torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
howtocaption/train.py \
-c configs/VL_training/dual_encoder_retrieval.yaml \
--distributed 1 \
--world_size ${WORLD_SIZE}
See more configs in configs/
.
We share all steps of the HowToCaption framework with an example of applying it to the HowTo100M dataset.
-
Make sure
videos
andvideo_path_downloaded.csv
are indata/howto100m
(as described in Data Preparation). -
Prepare ASR annotation. Download them, filter by downloaded videos, and divide them into 200-word blocks. We used the Sentencified HTM version of ASR annotation, where ASR was preprocessed into full sentences.
wget http://www.robots.ox.ac.uk/~htd/tan/sentencified_htm_1200k.json -P data/howto100m
python howtocaption/llm_prompting/scripts/1_filter_available.py --asr data/howto100m/sentencified_htm_1200k.json \
--csv data/howto100m/video_path_downloaded.csv --output_folder data/howto100m/
python howtocaption/llm_prompting/scripts/2_create_word_blocks.py --asr data/howto100m/asr_filtered.pickle \
--n_words_max 200 --output_key '200w'
You can use video_path_filtered_s50.pickle
with only 50 videos for a quick start with later prompting Vicuna:
python howtocaption/llm_prompting/scripts/2_create_word_blocks.py --asr data/howto100m/asr_filtered_s50.pickle \
--n_words_max 200 --output_key '200w'
-
Download Vicuna weights. We used Vicuna-13B (v0). To download LLaMA weights and Vicuna-13B delta and apply delta weights, follow the official instruction "How to Apply Delta Weights".
-
Prompt the Vicuna model to transform ASRs into captions. Results will be saved in a separate file for each video_id. Tip: Run the same script on multiple GPUs to speed up processing. You may use
asr_filtered_s50.pickle
for a quick start.
python howtocaption/llm_prompting/prompt_vicuna.py --config configs/vicuna/final_prompt.yaml \
--asr-path data/howto100m/asr_filtered.pickle \
--model-path '/BS/nshvetso/work/cache/huggingface/transformers/models--vicuna-13b'
- Collect all Vicuna predictions into a single pickle with input timestamps.
python howtocaption/llm_prompting/scripts/3_collect_predictions.py --config configs/vicuna/final_prompt.yaml \
--asr-path data/howto100m/asr_filtered.pickle \
--output-path output/vicuna/final_prompt.pickle
- Extract embeddings for all frames:
config=blip
python howtocaption/save_frame_embeddings.py \
-c configs/align_and_filter/${config}.yaml
Tip: Use --process_only_part_i
and --number_of_parts
to process only part of the input data in the current process. For example:
config=blip
process_only_part_i=0
python howtocaption/save_frame_embeddings.py \
-c configs/align_and_filter/${config}.yaml \
--process_only_part_i ${process_only_part_i} \
--number_of_parts 64
- Extract embeddings for all generated captions:
config=blip
llm_predictions=final_prompt
python howtocaption/save_text_embeddings.py \
--llm_predictions output/vicuna/${llm_predictions}.pickle \
-c configs/align_and_filter/${config}.yaml
- Create alignment and filtering:
config=blip
llm_predictions=final_prompt
python howtocaption/align_and_filter.py \
--frame_embeddings output/embeddings/video_${config}.pickle \
--text_embeddings output/embeddings/text_${config}_${llm_predictions}.pickle \
--top_pairs_threshold 25000000 \
--output output/generated_dataset/${config}_${llm_predictions}.pickle
- Fine-tune the video-language model on initial alignments (or use our fine-tuned model).
Add the path to aligned captions in the config
configs/align_and_filter/finetune_1round.yaml
and fine-tune the model:
export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost
torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
howtocaption/train.py \
-c configs/align_and_filter/finetune_1round.yaml\
--distributed 1 \
--world_size ${WORLD_SIZE}
- Extract text and video features with the new model and repeat re-alignment using both the original and new features.
config1=blip
config2=blip_ft_1round
llm_predictions=final_prompt
python howtocaption/align_and_filter.py \
--frame_embeddings output/embeddings/video_${config1}.pickle output/embeddings/video_${config2}.pickle \
--text_embeddings output/embeddings/text_${config1}_${llm_predictions}.pickle output/embeddings/text_${config2}_${llm_predictions}.pickle \
--top_pairs_threshold 25000000 \
--output output/generated_dataset/average_${llm_predictions}.pickle
The main structure of the code is based on https://github.com/victoresque/pytorch-template, which is licensed under MIT.
The code is partly derived from https://github.com/salesforce/BLIP, https://github.com/ArrowLuo/CLIP4Clip, https://github.com/whwu95/Cap4Video, https://github.com/lm-sys/FastChat, https://github.com/tylin/coco-caption is licensed under an Apache License 2.0 or MIT or BSD-3.
All other code is licensed under MIT. All license clauses are in the LICENSE file.
If you use this code in your research, please cite:
@article{shvetsova2023howtocaption,
title={HowToCaption: Prompting LLMs to transform video annotations at scale},
author={Shvetsova, Nina and Kukleva, Anna and Hong, Xudong and Rupprecht, Christian and Schiele, Bernt and Kuehne, Hilde},
journal={ECCV},
year={2024}
}