Speech emotion recognition 🎤

This project aims to classify emotions from speech. It is divided into two parts:

Audio signal analysis through CNN/LSTM models
An alternative solution with a script using a speech to text library and a pre-trained transformer model

Audio signal analysis through CNN/LSTM models 🎧

Datasets and data pre-processing

For this project, we used three datasets:

The RAVDESS dataset. It contains 1440 audio files (16-bit, 48kHz) with 60 different actors. Each actor recorded 2 different utterances for each of the 7 emotions (calm, happy, sad, angry, fearful, surprise, and disgust ).
The CREMA-D dataset. It contains 7442 audio files (16-bit, 48kHz) with 91 different actors. Each actor recorded 12 different utterances for each of the 6 emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad).
The TESS dataset. It contains 2800 audio files (16-bit, 48kHz) with 2 different actors. Each actor recorded 7 different utterances for each of the 7 emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral).

Data pre-processing:

We used the librosa library to extract features from the audio files. The features extracted are:

Mel-frequency cepstral coefficients (MFCC)
Chroma frequencies
Mel-spectrogram

The features are then split in train/validation sets and saved into .npy files.

The models were trained on different combinations of these datasets to find the best results. We found that merging the 3 datasets gave the most satisfying results, as it helped the model to generalize and avoid overfitting.

The models

Several models were tried, replicating the ones we found in the literature. The best results were obtained with CNN models, followed by an hybrid CNN/LSTM model.

The models are composed of 5 to 8 convolutional layers, followed by Batch Normalization and max pooling at some points, according to the paper they were based on.
The output is then flattened and fed to a dense layer. The output of the dense layer is then fed to a softmax layer with the number of emotions as output.
The hybrid model adds 2 LSTM layers between the convolutional layers and the dense layer.
You can check some of the models in the /models/model_plot folder. Two of them are saved and ready to be trained in the /model_training/train.py file.

We also made a custom class with keras-tuner to find the best hyperparameters for the CNN model. You can find it in the /model_training/CNNHypermodel.py file. It is used in the /model_training/HM_train.py file.

Results

Matrix confusion of the results of a CNN model trained on RAVDESS, excluding singing and calm emotions:

Accuracy of one of the CNN model on RAVDESS with 6 emotions:

How to use it

UPDATE May 2023: The library used to download youtube videos seems to be deprecated. You can still download the audio files from youtube and convert them to wav with ffmpeg.

Play audio and print emotions by chunks of DURATION seconds:

$> python3 analyze_audio.py <audio.wav>

Download a youtube video and print emotions by chunks of DURATION seconds:

$> python3 ser_demo.py <youtube_url>

For a given model, print accuracy, precision, recall, f1-score for each class and the confusion matrix:

$> python3 stats.py <model>

Data extraction tools

youtube_to_wav.py: download a youtube video and convert it to wav
data_extraction.py: extract features from dataset(s) and save x and y into .npy files
merge_datasets.py: concatenate two datasets by axis 0 and saves them into .npy files

Speech to text (transformers model) with 🤗

Small script that uses existing libraries and a transformer model from Hugging Face.
The script will:

Record audio from microphone
Transcribe audio to text
Translate to english if needed
Classify with a pre-trained transformer model

Use the following command to run the prediction from microphone:

$> python3 prediction_from_microphone.py

Then just speak when prompted. Text and emotion prediction ratios will be printed in the terminal.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
extraction_tools		extraction_tools
figures		figures
model_training		model_training
models/plots_model		models/plots_model
np_arrays		np_arrays
.gitignore		.gitignore
README.md		README.md
analyze_audio.py		analyze_audio.py
prediction_from_microphone.py		prediction_from_microphone.py
requirements.txt		requirements.txt
ser_demo.py		ser_demo.py
stats.py		stats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech emotion recognition 🎤

This project aims to classify emotions from speech. It is divided into two parts:

Audio signal analysis through CNN/LSTM models 🎧

Datasets and data pre-processing

For this project, we used three datasets:

Data pre-processing:

The models

Results

How to use it

Data extraction tools

Speech to text (transformers model) with 🤗

About

Releases

Packages

Languages

ThePush/Speech-emotion-recognition

Folders and files

Latest commit

History

Repository files navigation

Speech emotion recognition 🎤

This project aims to classify emotions from speech. It is divided into two parts:

Audio signal analysis through CNN/LSTM models 🎧

Datasets and data pre-processing

For this project, we used three datasets:

Data pre-processing:

The models

Results

How to use it

Data extraction tools

Speech to text (transformers model) with 🤗

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages