Identifying which celebrity is speaking through deep neural networks. Applications for this project include machines matching commands with individuals through their voices, such that in the future, the machines can anticipate personal commands.
- Collected data from VoxCeleb[1], a database of celebrities' voices and images.
- Extracted audio features referencing Aaqib Saeed's code
- MFCC: Mel-frequency cepstral coefficients
- Melspectrogram: Compute a Mel-scaled power spectrogram
- Chorma-stft: Compute a chromagram from a waveform or power spectrogram. "In music, the term chroma feature or chromagram closely relates to the twelve different pitch classes. Chroma-based features, which are also referred to as "pitch class profiles", are a powerful tool for analyzing music whose pitches can be meaningfully categorized and whose tuning approximates to the equal-tempered scale." (Wikipedia)
- Spectral Contrast: Compute spectral contrast. Spectral contrast is defined as the level difference between peaks and valleys in the spectrum
- Tonnetz: Computes the tonal centroid features (tonnetz)
- Store features in pandas dataframe and prepare data for modelling
- Run models
Miranda Cosgrove's Mel | Smokey Robinson's Mel |
---|---|
Miranda Cosgrove's Tonnetz | Smokey Robinson's Tonnetz |
---|---|
The best model was an 18 layer - CNN model using selu activation function, yielding 0.73 F1-score.
ROC Curve for all celebrities:
Challenged my audience to try and guess the celebrity from an audio clipped I played. I then ran my model to try and determine the person as well. My model won with 3 more correct guesses than the audience.
Citation:
[1] A. Nagrani, J. S. Chung, A. Zisserman
VoxCeleb: a large-scale speaker identification dataset
INTERSPEECH, 2017