Skip to content

gnekt/Image-Captioning-with-Python

Repository files navigation

A Pythonic Implementation of Image Captioning, C[aA]RNet!

forthebadge made-with-python PyTorch PyPI license

Convolutional(and|Attention)RecurrentNet! The goal of the project is to define a neural model for retrieve a caption given an image.

Given a dataset, a neural net composed by:

  • Encoder (Pre-trained Residual Neural Net.)
  • Decoder (A LSTM Model) Will represent the image in a space defined by the encoder, this representation is fed to the decoder (in different ways) that will learn to provide as output a caption, contestualized by the image and by the words generated at each timestep by the LSTM.

Here you can find the table of content, before you start remember:

You can close the project, the story ends. You ca read the project, you stay in wonderland, and I show you how deep the rabbit hole goes.

Under the Hidden Directory .saved can be found all the pretrained versions of C[aA]RNet.

Table of contents

Prerequisite Knowledge

For better understanding the code and the information inside, since this repository has the scope to be understandable for all the curious and not only for the people involved in this kind of topic, maybe is useful to take a look at this references:

-Pytorch documentation -Convolutional Neural Net (Stanford Edu) -Recurrent Neural Net (Stanford Edu) -Residual Neural Net (D2L AI)

How to run the code

Linux macOS Windows

The code can be run in every OS, feel free to use whatever you want. Of course a high-end machine is mandatory, since the huge amout of needed data can lead to out-of-memory error in low-end machine. Remember that the dataset need to be downloaded before launching and it must respect the dataset format How to prepare the dataset for C[aA]RNet training?

  1. Download the dataset.
  2. Extract it in the root of the repository.
  3. Rename the folder into dataset
  4. Rename the images folder into images

If you have a special situation, you can modify the VARIABLE.py file and/or some optional parameters before launching the script (CLI Explanation ).

Python supported versions

The code is ready to run for every version of python greater than 3.6. As you will see also in the code, some facilities are not available in python versions lower than 3.9. All this tricky situations are marked into the code with a comment, so you can choose what you prefer by un/commenting them.

Libraries Dependency

Library Version
Torch 1.3.0+cu100
Torchvision 0.4.1+cu100
Pillow 8.4.0
Numpy 1.19.5
Pandas 1.1.5
Matplotlib 3.3.4

Naturally inside the root of the package is present a requirements.txt file, you can install in your enviroment (or v.env.) all the required packages with the command below, executed in the shell with the enviroment activated:

pip install -r requirements.txt

If pip doesn't find this version of torch, you can execute in the shell with venv activated:

pip install torch==1.3.0+cu100 torchvision==0.4.1+cu100 -f https://download.pytorch.org/whl/torch_stable.html

Enviroment Variable

Since some attributes of the repository are useful in more than one file, create an enviroment container is a way to accomplish this necessity. Use a .env file is the most straightforward method, but since we want full compatibility among OS, a VARIABLE.py is a good compromise. The CONSTANT defined are the following:

COSTANT MEANING
MAX_CAPTION_LENGTH From the dataset are picked only samples whose caption has a length less or equal than this value
IMAGES_SUBDIRECTORY_NAME The directory name which contains all the images (It must be under the root of the dataset folder)
CAPTION_FILE_NAME The file under the root of the dataset folder which contains all the captions.
EMBEDDINGS_REPRESENTATION The way of creating the word embedding. UN-USED FOR NOW

CLI Explanation

The code can be run through a shell, here you can find how you can execute correctly the script, what are the custom parameters that you can feed before the execution and what are the meaning for each of them. First of all the always present part is the invocation of the interpreter and the main file:

python main.py

After, the helper is prompted and you can see something like this:

usage: main.py [-h] [--attention ATTENTION]
               [--attention_dim ATTENTION_DIM]
               [--dataset_folder DATASET_FOLDER]
               [--image_path IMAGE_PATH]
               [--splits SPLITS [SPLITS ...]]
               [--batch_size BATCH_SIZE]
               [--epochs EPOCHS] [--lr LR]
               [--workers WORKERS]
               [--device DEVICE]
               {RNetvI,RNetvH,RNetvHC,RNetvHCAttention}
               {train,eval} encoder_dim hidden_dim

The mandatory part is composed by this parameters:

Parameter Meaning Particular behavior
decoder The decoder that you want use for the decoding part of the model, options: {RNetvI,RNetvH,RNetvHC,RNetvHCAttention} Description of each type of decoder can be found in the next chapters
mode The way of working of the net, training for training mode and eval for evaluation mode options: {train,eval}
encoder_dim The dimesion of image projection of the encoder Decoder=RNetvI => Don't care / Decoder=RNetvHCAttention => 2048
hidden_dim The capacity of the LSTM

Optional parameters:

Parameter Meaning Default
--attention Use attention model. Default False If enabled, decoder and encoder are forced to be CResNet50Attention and RNetvHCAttention
--attention_dim Capacity of the attention unit. (Default 1024)
--dataset_folder The folder containing all the samples. (Default "./dataset") Used only in training mode
--image_path The absolute path of the image that we want to retrieve the caption. (Default '') Used only in evaluation mode
--splits Fraction of data to be used in train set, val set and test set (Default: 60 30 10) Used only in training mode
--batch_size Mini batch size (Default: 32) Used only in training mode
--epochs Number of training epochs (Default: 500) Used only in training mode
--lr Learning rate (Adam) (default: 1e-3) Used only in training mode
--workers Number of working units used to load the data (Default: 4) Used only in training mode
--device Device to be used for computations \in {cpu, cuda:0, cuda:1, ...} (Default: cpu) Used only in training mode

Examples

The following examples are the commands that i used for personal experiments.

Training

CaRNetvI

python main.py RNetvI train 0 1024 --dataset_folder ./dataset --device cuda:0 --epochs 150

CaRNetvH

python main.py RNetvH train 1024 1024 --dataset_folder ./dataset --device cuda:0 --epochs 150

CaRNetvHC

python main.py RNetvHC train 1024 1024 --dataset_folder ./dataset --device cuda:0 --epochs 150

CaRNetvHCAttention

python main.py RNetvHCAttention train 1024 1024 --dataset_folder ./dataset --device cuda:0 --epochs 150 --attention t --attention_dim 1024

Evaluation

CaRNetvI

python main RNetvI eval 5078 1024 --image_path ./33465647.jpg

CaRNetvH

python main.py RNetvH eval 1024 1024 --image_path ./33465647.jpg

CaRNetvHC

python main.py RNetvHC eval 1024 1024 --image_path ./33465647.jpg

CaRNetvHCAttention

python main.py RNetvHCAttention  eval 1024 1024 --attention t --attention_dim 1024 --image_path ./33465647.jpg

GPUs Integration

As you already seen in the cli explanation chapter, this code has support for GPUs (only NVIDIA atm.). You need the CUDA Driver installed, if you want mantain the consistency among the torch version installed by default with requirements.txt and a cuda driver version, you can install the v440 version of NVIDIA driver + Cuda 10.2.

Data Pipeline

For a better understanding of what happens inside the script is useful to visualize the data pipeline, considering:

  • The Dataset: the container of all the examples (no separation is considered among training set, and others..)
  • The Vocabulary: each example has a collection of words as a caption, it is useful have a vocabulary containing all of them.
  • The C[aA]RNet: the neural network, no distinction for now among with/without Attention.

Further explanation on the architecture of each single entity are in the following sections. For now it is enough know that the script need these 3 entities for working with data. Data Pipeline Imagine to split each operation in a timestep.

  • T_0: The dataset is loaded
  • T_1:
    • a) The dataset is casted into a Dataloader (pytorch class).
    • b) Given the dataset, an associated vocabulary is defined.
  • T_2: A dataloader is created
  • T_3: C[aA]RNet will use both dataloader and vocabulary for training operations, in evaluation mode instead, only the vocabulary is taken in consideration since the dataloader has size 1.

Dataset Format

The way on how the Dataset is defined, follow the structure proposed by this dataset Flickr30k Image Dataset: https://www.kaggle.com/hsankesara/flickr-image-dataset

The filesystem structure is the following:

dataset/ ├─ images/ │ ├─ pippo_pluto_paperino.jpg ├─ results.csv

Images

Images folder contain the jpeg images, each of them must have a name without space. pippo_pluto_paperino.jpg

Results

File which contain the collection of captions. Since a caption could contain the symbol comma (,) , the separator of each column will be a pipe (|). The first row of the file is the header, the column associated are defined in the table below:

Parameter Type Description
image_name string File name of the associated caption
comment_number int Index of the caption
comment* string The caption

*The caption, for a better tokenization, must have each word separated by a white_space character. Moreover the simbol dot (".") define the end of the caption.

What does the script produce

Since the project born also with the purpose of continue the developing, maybe for further features is also useful describe what the net produce as output, mainly the outputs can be divide in two large groups:

  • What the net produce during the training.
  • What the net produce for an evaluation.

During training

During the training procedure the following outputs are produced:

  1. For each mini-batch of each epoch: store the loss and accuracy in a Dataframe.
  2. For each epoch, store the accuracy on validation set in a Dataframe.
  3. For each epoch, store a sample of caption generation on the last element of the last mini-batch in the validation set.
  4. Every time that the net reaches the best value in accuracy on validation data, the net is stored in non-volatile memory.

1

The Dataframe is stored as csv file train_results.csv at the end of each epoch, with the following structure: The first row of the file is the header, the column associated are defined in the table below:

Parameter Type Description
Epoch int The epoch id
Batch int The batch id
Loss float The loss evaluated for this batch
Accuracy float The accuracy evaluated for this batch

2

The Dataframe is stored as csv file validation_results.csv at the end of each epoch, with the following structure: The first row of the file is the header, the column associated are defined in the table below:

Parameter Type Description
Epoch int The epoch id
Accuracy float The accuracy evaluated for the validation set in this epoch

3

The features extracted from the last image of the last batch in the validation set are fed to the net in evaluation mode. A caption.png file is generated. It includes the caption generated from C[aA]RNet and the source image. If the attention is enabled, a file named attention.png is also produced and it includes for each word generated the associate attention in the source image.

4

Every time, during the training, if a peack of accuracy in the evaluation of the validation set is reached, the net is stored in non-volatile memory. The directory on which the net are stored is hidden and it is called .saved under the root of the repository. This file are crucial for further training improvement and for evaluations after training. The pattern of each file is the following:

  • Encoder: NetName_encoderdim_hiddendim_attentiondim_C,pth
  • Decoder: NetName_encoderdim_hiddendim_attentiondim_R,pth

Of course these parameters depend on what we provide at the training.

During evaluation

The image is loaded, pre-processed, and fed to C[aA]RNet. A caption.png file is generated. It includes the caption generated from C[aA]RNet and the source image. If the attention is enabled, a file named attention.png is also produced and it includes for each word generated the associate attention in the source image.

Project structure

The structure of the project take into account the possibility of expansion from the community or by a personal further implamentation. This diagram is only general, and has the scope of grabbing what you could expect to see in the code, so the entities are empty and connected following their depencies. Each method has a related docstring, so use it as reference. UML

Filesystem

The Filesystem structure of the project has this form:

C[aA]RNet/
├─ .saved/
├─ dataset/
│  ├─ images/
│  ├─ results.csv
├─ NeuralModels/
│  ├─ Attention/
│  │  ├─ IAttention.py
│  │  ├─ SoftAttention.py
│  ├─ Decoder/
│  │  ├─ IDecoder.py
│  │  ├─ RNetvH.py
│  │  ├─ RNetvHC.py
│  │  ├─ RNetvHCAttention.py
│  │  ├─ RNetvI.py
│  ├─ Encoder/
│  │  ├─ IEncoder.py
│  │  ├─ CResNet50.py
│  │  ├─ CResNet50Attention.py
│  ├─ CaARNet.py
│  ├─ Dataset.py
│  ├─ FactoryModels.py
│  ├─ Metrics.py
│  ├─ Vocabulary.py
├─ VARIABLE.py
├─ main.py
File Description
VARIABLE.py Costant value used in the project
main.py Entry point for execute the net
IAttention.py The interface for implementing a new Attention model
SoftAttention.py Soft Attention implementation
IDecoder.py The interface for implementing a new decoder
RNetvH.py Decoder implementation as LSTM H-version
RNetvHC.py Decoder implementation as LSTM HC-version
RNetvHCAttention.py Decoder implementation as LSTM HC-version with Attention mecchanism
IEncoder.py The interface for implementing a new encoder
CResNet50.py ResNet50 as encoder
CResNet50Attention.py ResNet50 as encoder ready for attention mechanism
CaRNet.py C[aA]RNet implementation
Dataset.py Manager for a dataset
FactoryModels.py The Factory Design Pattern Implementation for every neural model proposed
Metrics.py Produce report file
Vocabulary.py Vocabulary manager entity

Interfaces

Interfaces are used for definig a contract among all of you that want to implement a new Encoder, Decoder or Attention model. Follow the interface is mandatory, in the docstring you can see also the suggested parameter for each method.

Encoder

The two encoder proposal are based on ResNet50 (He et al. 2015, Deep Residual Learning for Image Recognition). Depending on if we want attention or not, it removes one or more layer from the original net.

ResNet50

(ResNet-50 neural network architecture [56].) Privacy-Constrained Biometric System for Non-Cooperative Users

CResNet50

The 1st implementation remove the last layer from ResNet50, exposing the GlobalAveragePooling. Next to the pooling a linear layer of dimension encoder_dim is added, it will receive as input what the AveragePooling produce, in case of ResNet50 2048inputs.

CResNet50Attention

The 2nd implementation remove the 2 last layers from ResNet50 (AveragePooling + FC), and expose the last convolutional layer that produce a tensor of shape: (Heigth/32, Width/32, 2048). Each portion has a 2048 vector representation. By default the total number of portions with a squared RGB images as input (3,224,224) is 49.

Decoder

The decoder is based on the concept of Recurrent Neural Network, specifically in the declination of LSTM (Long-Short Term Memory) a type of RNN that exploit the way of updating the hidden state of the Network. LSTM (The structure of the Long Short-Term Memory (LSTM) neural network. Reproduced from Yan [38].) Application of Long Short-Term Memory (LSTM) Neural Network for Flood Forecasting

Each model starts from this idea and try different solution for feeding the initial context retrieved by the encoder:

  1. RNetvI: The image context is the input of the LSTM at time t_0.
  2. RNetvH: The image context is placed in the hidden state at time t_0.
  3. RNetvHC: The image context is placed in the hidden and cell state at time t_0.
  4. RNetvHCAttention: The image context is placed in the hidden and cell state, and at each time step t a vectorial representation of the focus of attention is concatenated with the input of the LSTM.

RNetvI

The 1st implementation use the image context as the first input of the lstm.

RNetvI (Vinyals et al. 2014) Show and Tell: A Neural Image Caption Generator

The only constraint is that the image context need to be projected into the word embeddings space.

RNetvH

RNetvH initialize at time t_0 only the hidden state with the image context retrieved by the ResNet.

RNetvH (Vinyals et al. 2014) Show and Tell: A Neural Image Caption Generator (Modified version by christiandimaio)

RNetvHC

RNetvHC initialize at time t_0 both the hidden and cell state with the image context retrieved by the ResNet RNetvHC (Vinyals et al. 2014) Show and Tell: A Neural Image Caption Generator (Modified version by christiandimaio)

RNetvHCAttention

This implementation combine a RNetvHC with Attention mecchanism. RNetvHCAttention Credit to christiandimaio et al. 2022

Training Procedure

The training procedure involve the training set and the validation set.

  • The training set is splitted into mini-batch of defined size (parameter) and shuffled.
    • For each mini-batch:
      • Provide the batch to the encoder that will produce a context vector for each element in the batch.
      • Assuming that the tensor containing the captions (already translated into vector of id referred to word into the vocabulary) associated with the images batch are padded with zeros and ordered with a decreasing length.
      • The context vectors and the captions are feeded into the Decoder.
      • The output of the decoder will be the input of the method pack_padded_sequence, that will remove the pad region for each caption.
      • The loss is evaluated and the backpropagation + weight update is done.
  • The accuracy is evaluated for the validation set.
    • If we have a new best model, the net is stored in files.

Loss type

The loss used is the CrossEntropyLoss, because in pytorch internally use a soft-max over each output t (remember the outputs of the lstm have the dimension of the vocabulary and we want the most likely word) and a NegativeLogLikelihood.

Where p_t:

The loss used follow the paper (Vinyals et al. 2014) Show and Tell: A Neural Image Caption Generator

Remark: Loss in the attention version

In the attention version we add a second term to the loss, the double stochastic regularization.

This can be interpreted as encouraging the model to pay equal attention to every part of the image over the course of generation.

Personal Experiments

Here you can see the training that i launched as experiments, the pretrained networks can be found under the folder .saved Training Table

References

Authors