- Audio signal analysis through CNN/LSTM models
- An alternative solution with a script using a speech to text library and a pre-trained transformer model
-
The RAVDESS dataset. It contains 1440 audio files (16-bit, 48kHz) with 60 different actors. Each actor recorded 2 different utterances for each of the 7 emotions (calm, happy, sad, angry, fearful, surprise, and disgust ).
-
The CREMA-D dataset. It contains 7442 audio files (16-bit, 48kHz) with 91 different actors. Each actor recorded 12 different utterances for each of the 6 emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad).
-
The TESS dataset. It contains 2800 audio files (16-bit, 48kHz) with 2 different actors. Each actor recorded 7 different utterances for each of the 7 emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral).
We used the librosa library to extract features from the audio files. The features extracted are:
- Mel-frequency cepstral coefficients (MFCC)
- Chroma frequencies
- Mel-spectrogram
The features are then split in train/validation sets and saved into .npy files.
The models were trained on different combinations of these datasets to find the best results. We found that merging the 3 datasets gave the most satisfying results, as it helped the model to generalize and avoid overfitting.
Several models were tried, replicating the ones we found in the literature. The best results were obtained with CNN models, followed by an hybrid CNN/LSTM model.
The models are composed of 5 to 8 convolutional layers, followed by Batch Normalization and max pooling at some points, according to the paper they were based on.
The output is then flattened and fed to a dense layer. The output of the dense layer is then fed to a softmax layer with the number of emotions as output.
The hybrid model adds 2 LSTM layers between the convolutional layers and the dense layer.
You can check some of the models in the /models/model_plot
folder. Two of them are saved and ready to be trained in the /model_training/train.py
file.
We also made a custom class with keras-tuner to find the best hyperparameters for the CNN model. You can find it in the /model_training/CNNHypermodel.py
file. It is used in the /model_training/HM_train.py
file.
Matrix confusion of the results of a CNN model trained on RAVDESS, excluding singing and calm emotions:
Accuracy of one of the CNN model on RAVDESS with 6 emotions:
UPDATE May 2023: The library used to download youtube videos seems to be deprecated. You can still download the audio files from youtube and convert them to wav with ffmpeg.
Play audio and print emotions by chunks of DURATION seconds:
$> python3 analyze_audio.py <audio.wav>
Download a youtube video and print emotions by chunks of DURATION seconds:
$> python3 ser_demo.py <youtube_url>
For a given model, print accuracy, precision, recall, f1-score for each class and the confusion matrix:
$> python3 stats.py <model>
-
youtube_to_wav.py: download a youtube video and convert it to wav
-
data_extraction.py: extract features from dataset(s) and save x and y into .npy files
-
merge_datasets.py: concatenate two datasets by axis 0 and saves them into .npy files
Small script that uses existing libraries and a transformer model from Hugging Face.
The script will:
- Record audio from microphone
- Transcribe audio to text
- Translate to english if needed
- Classify with a pre-trained transformer model
Use the following command to run the prediction from microphone:
$> python3 prediction_from_microphone.py
Then just speak when prompted. Text and emotion prediction ratios will be printed in the terminal.