PyTorch re-implementation of some image captioning models.
-
show_tell
Show and Tell: A Neural Image Caption Generator. Oriol Vinyals, et al. CVPR 2015. [Paper] [Code]
-
att2all
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, et al. ICML 2015. [Paper] [Code]
-
adaptive_att
&spatial_att
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. Jiasen Lu, et al. CVPR 2017. [Paper] [Code]
You can train different models by editing caption_model
item in config.py
.
First, make sure your environment is installed with:
- Python >= 3.5
- java 1.8.0 (for computing METEOR)
Then install requirements:
pip install -r requirements.txt
For dataset, I use Flicker30k and Karpathy's split. It is also okey to use Flickr8k or MSCOCO 2014 (their splits and captions are also contained in Karpathy's split). If you want to use other datasets, you may have to create a JSON file which looks like Karpathy's JSON.
Edit options and hyper parameters in config.py
. Refer to this file for more information about each item.
First of all, you should preprocess the images along with their captions and store them locally:
python preprocess.py
If you would like to use pre-trained word embeddings (like GloVe), just set embed_pretrain
to True
and specify the path to pre-trained vectors (embed_path
) in config.py
. You could also choose to fine-tune word embeddings or not with by editing fine_tune_embeddings
item.
Or if you want to randomly initialize the embedding layer's weights, set embed_pretrain
to False
and specify the embedding size (embed_dim
).
To train a model, just run:
python train.py
If you have enabled tensorboard (tensorboard=True
in config.py
), you can visualize the losses and accuracies during training by:
tensorboard --logdir=<your_log_dir>
To test a checkpoint on test set and compute evaluation metrics:
python test.py
Now BLEU, CIDEr, METEOR and ROUGE-L are supported. Implementations of these metrics are under metrics
folder.
During training stage, the BLEU-4 and CIDEr scores on validation set would be computed after each epoch's validation. However, since the decoder's input at each timestep is the word in ground truth captions, but not the word it generated in the previous timestep (Teacher Forcing), such scores does not reflect the real performance. So you could also consider about using this script to compute the correct scores for a specific trained model on validation set.
To generate a caption (and visualize the attention weights if the model use an attention module) on a specific image:
First edit the following items in inference.py
:
model_path = 'path_to_trained_model'
wordmap_path = 'path_to_word_map'
img = 'path_to_image'
beam_size = 5 # beam size for beam search
Then run:
python inference.py
- The
load_embeddings
method (inutils/embedding.py
) would try to create a cache for loaded embeddings under folderdataset_output_path
. This dramatically speeds up the loading time the next time.
Here are some examples of the captions generated on images in test set.
I haven't fine-tuned CNN. You'd probably want to try fine-tuning it to get better results.
Errors: two boys, not a chair...
Error: not crying...
Errors: not a woman, and seems to recognize sleeves as jeans...
- This project is based on sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.
- Implementations of evaluation metrics are adopted from ruotianluo/coco-caption.