Due to a GitHub bug (issue #3035 & #3555), sometimes the notebook files (files ending in ".ipynb") may not render. Please reload the page until the content can be displayed. If that is not possible, then shared Google Colab notebook links are provided to each directory.
- Summary
- Research Problem - Domain
- Source of Data
- Natural Language Processing (NLP)
- Models' Production
- Conclusions
- Acknowledgments
- Dissertation Paper
- Guidance through the coding files
What makes a movie e.g. suspenseful, dramatic, or sci-fi, and which words hidden in movie plots can predefine this? The answer to that would help to extract significant information from narrative texts and build an automatic system that could produce emotional tags.
This project was submitted in partial fulfillment of requirements for the degree of MSc Information Management at Strathclyde University (August 2020). It was marked with distinction, and it has received citations from other academic papers based on mentioned reports at academia.edu.
The aim was to identify, define, and automatically predict a set of emotions in movies based on their movie abstracts and metadata. This was achieved with a series of steps including data collection and cleaning, descriptive and inferential statistics, data preprocessing, and Machine & Deep Learning tools and models with a core element and focus on Natural Language Processing.
The problem was treated as a multi-label classification one, and the result was the prediction of emotions in 55,577 unlabelled movies, as well as the identification of correlations between the predicted emotions and users' ratings and preferences.
Overall, various correlation tests indicated a strong relationship between users' watchlist and their respective emotional tags. It was concluded that the notion of emotion can constitute an important feature in the movie industry with regard to recommender systems and advertising companies for generating, finding, and placing a higher level of personalized content.
The production of the model would presuppose the existence of some emotional classes-labels with regard to movies. However and based on the research conducted, there was found no dataset fulfilling that, and therefore, initial manual labeling was necessary, as well as that of the initiative and decision taken for a set of emotional tags to be used. Αfter thorough research, Ekman’s six discrete emotional classes (Ekman, 1992; Lazemi & Ebrahimpour-Komleh, 2016; Tkalčič et al., 2016) were decided to constitute the emotional tags. These classes can be easily linked with facial communication and expressions, and these are (in alphabetical order): anger, disgust, fear, happiness, sadness, and surprise (Ekman, 1992; Farzindar & Inkpen, 2015:p.50).
The main sources of the research data originate from:
- Secondary Data: They encompass a wide variety of movie metadata provided by the MovieLenes research dataset.
- Primary Data: Fetching the movie overviews for every one of the movies in the above dataset via a TMDB API for developers from the TMDb website.
A series of NLP techniques were applied such as: Part of Speech (POS) tags, Sentiment Analysis, Topic Modeling, Non-Negative Matrix Factorization, Named-Entity Recognition (NER).
The target variables for all models were the six declared emotions. However, the architecture for the selection of the predictor variables was composed of two kinds. The first type of architecture has as its feature predictors only the movie overviews, while the second type was composed of the movie overviews along with other variables (movie metadata) for a potentially better performance.
After data preprocessing, feature selection and feature engineering the best model was chosen and evaluated.
- Construction of Machine Learning models in Scikit-Learn:
- Logistic Regression as OvR classifier
- Multinomial Naive Bayes
- Linear SVClassifier
- Random Forest
- SGD Classifier
- SVC
- Neighbors Classifier
- SoftMax Classifier
- Decision Tree Classifier
- XGBClassifier
- Multi-layer (MLP) Perception Classifier (shallow network)
- Multi-layer (MLP) Perception Classifier (deep network)
- Construction of Deep Neural Network models in TensorFlow: A series of a wide variety of Convolutional and Recurrent Neural Network models, with word embeddings, pooling layers, Long Short-Term Memory (LSTM) units; either with single or multiple output layers.
Micro average F1 score, subset accuracy score, cross-validation score, hamming loss metric, ROC-AUC score.
The final model was also evaluated via the Scikit-Learn's multi-label confusion matrix report.
- Manual Labelling of 300 movies in terms of their represented emotion (the construction of an initial dataset with reference to their emotion was necessary for the models' production).
- Automatic extraction and prediction of emotional tags in 55,577 movies, as well the depiction of the combination of their emotions (e.g. some movies can generate more than one emotion).
- Intensity magnitude in the emotion production procedure: different movies may encompass different emotions and a different number/combination of emotions. But what about their intensity level? After proceeding to that initiative, the generated emotion for each one of the six emotions is decided by the researcher to have 3 intensity levels, i.e. low, moderate, and high.
- Emotional tags can constitute an additional feature in the movie industry for Recommender Systems (RSs) and advertising companies to integrate. The project can also be useful in the movie industry in the context of film psychology, as well as in the Emotion Aware Recommender Systems (EARSs) helping RSs to scale by recommending new items based on affective item features and users’ emotional reactions.
- Enhancement of RSs by refining the retrieval of similar movies/TV shows.
- Movies with law popularity may present few or no tags. Therefore, an automatic prediction of emotional tags can alleviate the problem of incompleteness in tag spaces, the cold start problem, and the data sparsity in RSs.
All respective references of authors and sources can be found inside the dissertation paper, the latter of which is also uploaded in this repository. Additional references with regard to coding can also be found inside the notebooks.
- Data exploration of all six Movielens csv files provided.
- Merging "genome tags" and "genome scores" and the introduction of ten stratums of relevance.
- Introducing a modified genome_scores csv file, keeping only the highest relevance scores and its metadata (relevance score>=0.7).
- Adding a new column "rating_cat" in ratings.csv (values [1,3]).
- Merging movies.csv & links.csv. Creating a new column "overview" in order later to extract data from TMDb.
- Fetching movie plots from TMDb via a tmdb api for developers.
- Several feature engineering.
- NLP techniques applied in movies_final dataframe.
- Tokenization & POS/ Visualization.
- Sentiment Analysis / Compound polarity scores.
- Non-Negative Matrix Factorization & Topic Modelling.
- Name Entity Recognition (NER) & Visualization / introducting the column "entities".
- construction of "movies_final_2.csv" .
- Fixing entities.
- Choosing genres for the final choice of movie sample for the manual emotion labelling.
- Emotion labelling in 300 movies.
- Construction of "movies_final_3.csv".
- Application of various Machine Learning models with NLP, using as features the movie overviews.
- Application of various Machine Learning models with NLP, using as features the movie overviews plus various movie metadata via Column Transformer with a pipeline.
- Here is where the final model architecture is located, although it got even more fine-tuned in "4e_Final_Model_&_Predictions.ipynb".
- Application of various Deep Learning models with NLP.
- Word Embeddings.
- Single and multiple output layers.
- Making Predictions with BERT via the FastBert Deep Learning library.
- Fine-tuning the final model initially created in "4b_Models_ML_Overviews & Metadata.ipynb".
- Making predictions in 55,577 unlabelled movies in terms of emotions.
- Extracting "predictions_decision_scores_df.csv".
-
Normality checks for checking the existence of normal distribution or not in variables used in hypothesis tests.
-
2 sets of hypothesis sets:
- ratings vs emotions (in all dataset - non user-centric approach): 4 tests.
- user-centric approach: a) ratings vs emotions (per user): 4 tests. b) emotion scores vs emotion scores in sets of movies in watchlists of users per unique user: 4 tests.
The dissertation paper submitted at Strathclyde University for the degree of MSc in Information Management at the Computer & Information Sciences department (August 2020), which contains the theoretical and practical part of this repository project.
Labelling_300_Movies.xlsx: The excel file containing the manual labelling of 300 movies. Note that they do not have a binary form of 0 and 1. This is because the researcher had in mind to use scores in range [0,3], with values more than 1 implying more intensity. This was not used, however, 1) values that were >=1 were used as "1" and values with 0 as "0", 2) the intensity magnitude of emotions was finally attributed and constructed by the confidence scores of the model's decision function (where a higher confidence level score would indicate a higher emotional intensity level).