This repository provides an Encoder-Decoder Sequence-to-Sequence model to generate captions for input videos. Moreover, pre-Trained VGG16 model is being used to extract features for every frame of the video.
The ability to be applied for numerous applications mark Video Captioning's importance. For example, it can be applied to help search videos across web pages in an efficient manner and it can also cluster the videos having a large degree of similarity in terms of their respective generated captions.
Tensorflow
Keras
OpenCV
NumPy
FuncTools
- The MSVD dataset developed by Microsoft can be downloaded from here.
- This data set contains 1450 short YouTube clips that have been manually labeled for training and 100 videos for testing.
- Each video has been assigned a unique ID and each ID has about 15–20 captions.
- To extract features for frames of every single input videos using pre-Trained VGG16 model, run
Extract_Features_Using_VGG.py
. - To train the developed model, run
training_model.py
. - To use the trained Video Captioning model for inference, run
predict_model.py
. - To use the trained model for real-time Video-Caption generation, run
Video_Captioning.py
.
Following are a few results of the developed Video Captioning approach on test videos:-