Short Explanation
Speech Emotion Recognition (SER) is an essential task in human-computer interaction, and it has many applications in fields such as mental health, marketing, and education. However, accurately recognizing emotions in speech signals is a challenging problem due to the high variability and complexity of human emotions. In this thesis, I propose a deep learning-based approach for SER using Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks. I also explore the effectiveness of various data augmentation techniques to improve the performance of the proposed model.
Dataset
To train and evaluate the proposed SER model, I used three publicly available datasets: the CREMA-D dataset, SAVEE dataset, and the RAVDESS dataset. These datasets consist of speech recordings of actors performing various emotional states, including neutral, happy, sad, angry, and others. The datasets are pre-segmented, and the audio files are in WAV format.
Pre-processing
To extract the relevant features from the audio files, I used the Librosa library in Python. This library provides many useful functions for audio processing, such as feature extraction, spectral analysis, and time-frequency transformations. I used the Mel Frequency Cepstral Coefficients (MFCCs) as the main features for the SER model.
Model Architecture
The proposed SER model consists of a CNN and an LSTM network. The CNN is used to extract high-level features from the input spectrograms, and the LSTM is used to capture the temporal dependencies in the feature sequence.
Data Augmentation
We employed three data augmentation techniques: Pitch Shift, Gaussian Noise, and Time Stretch. The following image illustrates the difference in the data through direct subtraction. The code involves the original MFCCs vector (A) and the modified MFCCs vectors obtained by applying Pitch Shift (B), Gaussian Noise (C), and Time Stretch (D).
Result
The model's performance was evaluated across multiple datasets, and a comparison was made with other CNN models, including VGGNet and ResNet. The results indicate the following key findings:
- The proposed CNN-LSTM model outperforms other CNN models, such as VGGNet and ResNet, in terms of accuracy.
- The CNN-LSTM model exhibits higher accuracy in both augmented and non-augmented scenarios across all three datasets: 74.9% and 76% respectively.
- Interestingly, the performance improvement achieved by the CNN-LSTM model is relatively modest when compared to more complex model variations.
- The CNN-LSTM model shows a modest increase of only 1.5% in accuracy, whereas ResNet50v2, ResNet50v2-LSTM, and VGGNet16 achieve more substantial increases of 4.5%, 2.6%, and 1.9% respectively.
- Upon incorporating augmentation, the final model accuracy order is as follows:
- CNN-LSTM (76.4%)
- ResNet50v2-LSTM (69%)
- VGGNet16 (68.4%)
- ResNet50v2 (67.8%)
These results highlight the CNN-LSTM model's competitive performance and its potential for further improvements in more intricate model architectures. If you want to know more about my Thesis you could contact me from my email
Lesson Learned
Through this project, I gained valuable experience in working with audio data, deep learning models, and data augmentation techniques. I also learned how to visualize and interpret the internal representations of CNN layers using activation maps and feature visualization techniques. Overall, this project has helped me develop a deeper understanding of the challenges and opportunities in the field of SER, and also help me graduate from my universities.