DeepCaption is a framework for image captioning research using deep learning. The code is based on the image captioning tutorial by yunjey but has been extensively expanded since then.
The goal of image captioning is to convert a given input image into a natural language description. The encoder-decoder framework is widely used for this task. The image encoder is a convolutional neural network (CNN). Baseline code uses resnet-152 model pretrained on the ILSVRC-2012-CLS image classification dataset. The decoder is a long short-term memory (LSTM) recurrent neural network.
DeepCaption supports many features, including:
- external pre-calculated features stored in numpy, lmdb or PicSOM bin format
- persistent features (features input at each RNN iteration)
- soft attention
- teacher forcing scheduling
Some of the advanced features are documented on the separate features documentation page.
To get the latest release:
git clone https://github.com/aalto-cbir/DeepCaption
or to get the internal development version:
git clone https://version.aalto.fi/gitlab/CBIR/DeepCaption.git
For example if you have downloaded the COCO dataset, you might have the images under /path/to/coco/images and annotations in /path/to/coco/annotations.
First we resize the images to 256x256. This is just to speed up the training process.
./resize.py --image_dir /path/to/coco/images/train2014 --output_dir /path/to/coco/images/train2014_256x256
./resize.py --image_dir /path/to/coco/images/val2014 --output_dir /path/to/coco/images/val2014_256x256
Next, we need to set up the dataset configuration. Create a file datasets/datasets.conf
with the following contents:
[coco]
dataset_class = CocoDataset
root_dir = /path/to/coco
[coco:train2014]
image_dir = images/train2014_256x256
caption_path = annotations/captions_train2014.json
[coco:val2014]
image_dir = images/val2014_256x256
caption_path = annotations/captions_val2014.json
Now we can build the vocabulary:
./build_vocab.py --dataset coco:train2014 --vocab_output_path vocab.pkl
Example of training a single model with default parameters on COCO dataset:
./train.py --dataset coco:train2014 --vocab vocab.pkl --model_name mymodel
or if you wish to follow validation set metrics:
./train.py --dataset coco:train2014 --vocab vocab.pkl --model_name mymodel --validate coco:val2014 --validation_scoring cider
You can plot the training and validation loss and other statistics using the following command:
./plot_stats.py models/mymodel/train_stats.json
By adding --watch
you can have it update the plot automatically every time there are new numbers (typically after each epoch).
Now you can use your model to generate a caption to any random image:
./infer.py --model models/mymodel/ep5.model --print_results random_image.jpg
or a directory of any random images:
./infer.py --model models/mymodel/ep5.model --print_results --image_dir random_image_dir/
You can also do inference on any configured dataset:
./infer.py --model models/mymodel/ep5.model --dataset coco:val2014
You can add e.g., --scoring cider
to automatically calculate scoring metrics if a ground truth has been defined for that dataset.
Inference also supports the following flags:
--max_seq_length
- maximum length of decoded caption (in words)--no_repeat_sentences
- remove repeating sentences if they occur immediately after each other--only_complete_senteces
- remove the last sentence if it does not end with a period (and thus is likely to be truncated)
We are trying to maintain a standard project structure. One can be referred to this template for future development.
If self-critic loss is going to be used, CIDEr-D precomputation of n-grams needs to be done in order to speed up the training. Please see scripts/preprocess_ngrams.py
. It needs a dataset, a preprocessed vocabulary and the name of the output precomputations.
An usage example:
python scripts/preprocess_ngrams.py --dataset picsom:COCO:train2014+picsom:tgif:imageset --vocab ../vocab-coco.pkl --output ngrams_precomputed.pkl
This is then passed to train.py
as:
train.py --... --self_critical_loss sc --validation_scoring ciderd --cached_words ngrams_precomputed.pkl