Skip to content

A classifier to find emotions in audio voice files and through microphone. Using CNN/LSTM models.

Notifications You must be signed in to change notification settings

ThePush/Speech-emotion-recognition

Repository files navigation

Speech emotion recognition 🎤

This project aims to classify emotions from speech. It is divided into two parts:

  • Audio signal analysis through CNN/LSTM models
  • An alternative solution with a script using a speech to text library and a pre-trained transformer model

Audio signal analysis through CNN/LSTM models 🎧

Datasets and data pre-processing

For this project, we used three datasets:

  • The RAVDESS dataset. It contains 1440 audio files (16-bit, 48kHz) with 60 different actors. Each actor recorded 2 different utterances for each of the 7 emotions (calm, happy, sad, angry, fearful, surprise, and disgust ).

  • The CREMA-D dataset. It contains 7442 audio files (16-bit, 48kHz) with 91 different actors. Each actor recorded 12 different utterances for each of the 6 emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad).

  • The TESS dataset. It contains 2800 audio files (16-bit, 48kHz) with 2 different actors. Each actor recorded 7 different utterances for each of the 7 emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral).

Data pre-processing:

We used the librosa library to extract features from the audio files. The features extracted are:

  • Mel-frequency cepstral coefficients (MFCC)
  • Chroma frequencies
  • Mel-spectrogram

The features are then split in train/validation sets and saved into .npy files.

The models were trained on different combinations of these datasets to find the best results. We found that merging the 3 datasets gave the most satisfying results, as it helped the model to generalize and avoid overfitting.

The models

Several models were tried, replicating the ones we found in the literature. The best results were obtained with CNN models, followed by an hybrid CNN/LSTM model.

The models are composed of 5 to 8 convolutional layers, followed by Batch Normalization and max pooling at some points, according to the paper they were based on.
The output is then flattened and fed to a dense layer. The output of the dense layer is then fed to a softmax layer with the number of emotions as output.
The hybrid model adds 2 LSTM layers between the convolutional layers and the dense layer.
You can check some of the models in the /models/model_plot folder. Two of them are saved and ready to be trained in the /model_training/train.py file.

We also made a custom class with keras-tuner to find the best hyperparameters for the CNN model. You can find it in the /model_training/CNNHypermodel.py file. It is used in the /model_training/HM_train.py file.

Results

Matrix confusion of the results of a CNN model trained on RAVDESS, excluding singing and calm emotions:


Accuracy of one of the CNN model on RAVDESS with 6 emotions:

How to use it

UPDATE May 2023: The library used to download youtube videos seems to be deprecated. You can still download the audio files from youtube and convert them to wav with ffmpeg.

Play audio and print emotions by chunks of DURATION seconds:

$> python3 analyze_audio.py <audio.wav>

Download a youtube video and print emotions by chunks of DURATION seconds:

$> python3 ser_demo.py <youtube_url>

For a given model, print accuracy, precision, recall, f1-score for each class and the confusion matrix:

$> python3 stats.py <model>

Data extraction tools

  • youtube_to_wav.py: download a youtube video and convert it to wav

  • data_extraction.py: extract features from dataset(s) and save x and y into .npy files

  • merge_datasets.py: concatenate two datasets by axis 0 and saves them into .npy files



Speech to text (transformers model) with 🤗

Small script that uses existing libraries and a transformer model from Hugging Face.
The script will:

  1. Record audio from microphone
  2. Transcribe audio to text
  3. Translate to english if needed
  4. Classify with a pre-trained transformer model

Use the following command to run the prediction from microphone:

$> python3 prediction_from_microphone.py

Then just speak when prompted. Text and emotion prediction ratios will be printed in the terminal.

About

A classifier to find emotions in audio voice files and through microphone. Using CNN/LSTM models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages