HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Official PyTorch implementation of the paper "HowToCaption: Prompting LLMs to Transform Video Annotations at Scale", ECCV 2024.

by Nina Shvetsova*, Anna Kukleva*, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne.

[arXiv]

HowToCaption Dataset

We release the HowToCaption dataset. Check the dataset readme to download it.

The HowToCaption dataset comprises 1.2M long-term instructional videos from the HowTo100M dataset, where ASR subtitles have been transformed into proper captions via our HowToCaption method using the Vicuna-13B LLM (v0). The captions are automatically generated and their high-quality alignment to the video are further ensured through subsequent alignment and filtering post-processing, all achieved without any human involvement. As a result, the HowToCaption dataset contains 25M aligned video-text pairs.

Pretrained Models

Using the proposed HowToCaption dataset, we pretrained video-language models (initialized from the image-text BLIP model): All checkpoints are available here.

Method	Model size	Dataset	YouCook2		MSRVTT
			R1	R10	R1	R10
Dual Encoder Model	ViT-B	HowTo100M	12.2	39.3	30.8	61.7
Dual Encoder Model	ViT-B	WebVid2M	7.3	29.0	38.5	71.9
Dual Encoder Model	ViT-B	HowToCaption	13.4	44.1	37.6	73.3
Full Model (with re-ranking)	ViT-B	HowToCaption	18.15	50.4	44.3	76.6
Full Model (with re-ranking)	ViT-L	HowToCaption	19.9	53.2	45.2	77.8

Full Model (ViT-B) fine-tuned for video captioning:

Dataset	BLEU@4	METEOR	ROUGE	CIDEr
YouCook2	8.8	15.9	37.3	116.4
MSRVTT	49.8	32.2	66.3	65.3
MSVD	70.4	46.4	83.2	154.2

We also release weights for the fine-tuned VAST ViT-L model: weights.

Get Started

Set Up an Environment

conda create python=3.8 -y -n howtocaption
conda activate howtocaption
conda install -y pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
pip install -e .

Data Preparation

HowTo100M Dataset

Preprocess data into the data/howto100m folder:

Link the folder to the videos of HowTo100M in data/howto100m/videos
Create a CSV file data/howto100m/video_path_downloaded.csv with video_id and video_path correspondences (path should be relative to the folder data/howto100m/videos). For example:

    video_id, video_path
    RoupYOneCIo,food_and_entertaining/video_R/RoupYOneCIo.webm	 
    Zx3_yGY_ERs, food_and_entertaining/video_Z/Zx3_yGY_ERs.mp4

HowToCaption Dataset

Follow the dataset readme to download the HowToCaption dataset and store it in data/howtocaption.

MSRVTT, YouCook2, MSVD, LSMDC

Follow CLIP4CLIP guidelines to download the MSRVTT, MSVD, and LSMDC datasets. Follow the YouCook2 guidelines to download YouCook2. Put datasets in corresponding folders:
data/msrvtt, data/msvd, data/lsmdc, data/youcook2.

Video-Language Models

Configs

This repository uses YAML files to keep all hyperparameters. The configs folder contains configs for LLM prompting, vision-language model training, and evaluation.

Experiment Logging

This repository uses Sacred with neptune.ai for logging and tracking experiments. If you want to activate this:

Create a neptune.ai account (you may ask for an academic account if applicable).
Create a project, and copy your credentials (api_token, project name) in train.py.
Add the --neptune key to train.py.

Run Evaluation

Evaluate video retrieval (without re-ranking):

export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost

torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
    howtocaption/eval.py \
    --resume pretrained/dual_encoder_retrieval.pth\
    -c configs/VL_training/dual_encoder_retrieval.yaml \
    --distributed 1 \
    --world_size ${WORLD_SIZE} \
    --eval_retrieval

Evaluate video captioning:

export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost

torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
    howtocaption/eval.py \
    --resume pretrained/captioning_msrvtt.pth\
    -c configs/VL_training/captioning_msrvtt.yaml \
    --distributed 1 \
    --world_size ${WORLD_SIZE} \
    --eval_captioning

See more configs in configs/ and models here.

For retrieval evaluation with re-ranking, we followed the VAST implementation.

Run Training

Train Dual-Encoder Model (initialized from BLIP) on HowToCaption Dataset:

export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost

torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
    howtocaption/train.py \
    -c configs/VL_training/dual_encoder_retrieval.yaml \
    --distributed 1 \
    --world_size ${WORLD_SIZE}

Train Full Encoder-Decoder Model (initialized from BLIP) on HowToCaption Dataset:

export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost

torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
    howtocaption/train.py \
    -c configs/VL_training/dual_encoder_retrieval.yaml \
    --distributed 1 \
    --world_size ${WORLD_SIZE}

See more configs in configs/.

HowToCaption Framework

LLM Prompting

We share all steps of the HowToCaption framework with an example of applying it to the HowTo100M dataset.

Make sure videos and video_path_downloaded.csv are in data/howto100m (as described in Data Preparation).
Prepare ASR annotation. Download them, filter by downloaded videos, and divide them into 200-word blocks. We used the Sentencified HTM version of ASR annotation, where ASR was preprocessed into full sentences.

wget http://www.robots.ox.ac.uk/~htd/tan/sentencified_htm_1200k.json -P data/howto100m
python howtocaption/llm_prompting/scripts/1_filter_available.py --asr data/howto100m/sentencified_htm_1200k.json \
  --csv data/howto100m/video_path_downloaded.csv --output_folder data/howto100m/
python howtocaption/llm_prompting/scripts/2_create_word_blocks.py --asr data/howto100m/asr_filtered.pickle \
  --n_words_max 200  --output_key '200w'

You can use video_path_filtered_s50.pickle with only 50 videos for a quick start with later prompting Vicuna:

python howtocaption/llm_prompting/scripts/2_create_word_blocks.py --asr data/howto100m/asr_filtered_s50.pickle \
    --n_words_max 200 --output_key '200w'

Download Vicuna weights. We used Vicuna-13B (v0). To download LLaMA weights and Vicuna-13B delta and apply delta weights, follow the official instruction "How to Apply Delta Weights".
Prompt the Vicuna model to transform ASRs into captions. Results will be saved in a separate file for each video_id. Tip: Run the same script on multiple GPUs to speed up processing. You may use asr_filtered_s50.pickle for a quick start.

python howtocaption/llm_prompting/prompt_vicuna.py --config configs/vicuna/final_prompt.yaml \
  --asr-path data/howto100m/asr_filtered.pickle \
  --model-path '/BS/nshvetso/work/cache/huggingface/transformers/models--vicuna-13b'

Collect all Vicuna predictions into a single pickle with input timestamps.

python howtocaption/llm_prompting/scripts/3_collect_predictions.py --config configs/vicuna/final_prompt.yaml \
  --asr-path data/howto100m/asr_filtered.pickle \
  --output-path output/vicuna/final_prompt.pickle

Alignment & Filtering

Extract embeddings for all frames:

config=blip
python howtocaption/save_frame_embeddings.py \
  -c configs/align_and_filter/${config}.yaml

Tip: Use --process_only_part_i and --number_of_parts to process only part of the input data in the current process. For example:

config=blip
process_only_part_i=0
python howtocaption/save_frame_embeddings.py \
  -c configs/align_and_filter/${config}.yaml \
  --process_only_part_i ${process_only_part_i} \
  --number_of_parts 64

Extract embeddings for all generated captions:

config=blip
llm_predictions=final_prompt
python howtocaption/save_text_embeddings.py \
    --llm_predictions output/vicuna/${llm_predictions}.pickle  \
    -c configs/align_and_filter/${config}.yaml

Create alignment and filtering:

config=blip
llm_predictions=final_prompt
python howtocaption/align_and_filter.py \
  --frame_embeddings output/embeddings/video_${config}.pickle \
  --text_embeddings output/embeddings/text_${config}_${llm_predictions}.pickle \
  --top_pairs_threshold 25000000 \
  --output output/generated_dataset/${config}_${llm_predictions}.pickle

Fine-tune the video-language model on initial alignments (or use our fine-tuned model). Add the path to aligned captions in the config configs/align_and_filter/finetune_1round.yaml and fine-tune the model:

export MASTER_PORT=12345
export WORLD_SIZE=4
export MASTER_ADDR=localhost

torchrun --standalone --nnodes=1 --nproc_per_node=${WORLD_SIZE} \
    howtocaption/train.py \
    -c configs/align_and_filter/finetune_1round.yaml\
    --distributed 1 \
    --world_size ${WORLD_SIZE}

Extract text and video features with the new model and repeat re-alignment using both the original and new features.

config1=blip
config2=blip_ft_1round
llm_predictions=final_prompt
python howtocaption/align_and_filter.py \
  --frame_embeddings output/embeddings/video_${config1}.pickle output/embeddings/video_${config2}.pickle \
  --text_embeddings output/embeddings/text_${config1}_${llm_predictions}.pickle  output/embeddings/text_${config2}_${llm_predictions}.pickle \
  --top_pairs_threshold 25000000 \
  --output output/generated_dataset/average_${llm_predictions}.pickle

Acknowledgments and Licenses

The main structure of the code is based on https://github.com/victoresque/pytorch-template, which is licensed under MIT.

The code is partly derived from https://github.com/salesforce/BLIP, https://github.com/ArrowLuo/CLIP4Clip, https://github.com/whwu95/Cap4Video, https://github.com/lm-sys/FastChat, https://github.com/tylin/coco-caption is licensed under an Apache License 2.0 or MIT or BSD-3.

All other code is licensed under MIT. All license clauses are in the LICENSE file.

Citation

If you use this code in your research, please cite:

@article{shvetsova2023howtocaption,
  title={HowToCaption: Prompting LLMs to transform video annotations at scale},
  author={Shvetsova, Nina and Kukleva, Anna and Hong, Xudong and Rupprecht, Christian and Schiele, Bernt and Kuehne, Hilde},
  journal={ECCV},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
dataset		dataset
howtocaption		howtocaption
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
method.jpeg		method.jpeg
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

HowToCaption Dataset

Pretrained Models

Get Started

Set Up an Environment

Data Preparation

HowTo100M Dataset

HowToCaption Dataset

MSRVTT, YouCook2, MSVD, LSMDC

Video-Language Models

Configs

Experiment Logging

Run Evaluation

Run Training

HowToCaption Framework

LLM Prompting

Alignment & Filtering

Acknowledgments and Licenses

Citation

About

Releases

Packages

Contributors 2

Languages

License

ninatu/howtocaption

Folders and files

Latest commit

History

Repository files navigation

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

HowToCaption Dataset

Pretrained Models

Get Started

Set Up an Environment

Data Preparation

HowTo100M Dataset

HowToCaption Dataset

MSRVTT, YouCook2, MSVD, LSMDC

Video-Language Models

Configs

Experiment Logging

Run Evaluation

Run Training

HowToCaption Framework

LLM Prompting

Alignment & Filtering

Acknowledgments and Licenses

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages