Speech Emotion Recognition Model Using Python and Machine Learning in 9steps
Project on Speech Emotion Recognition Using Machine learning and python
In this article, I am going to show you how you can create a Machine Learning Model for Speech Emotion Recognition using python in Just 9 Steps.
Speech is the most natural way of expressing ourselves as humans.
It is only natural then to extend this communication medium to computer applications. We define speech emotion recognition (SER) systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions.
As human beings speech is amongst the most natural way to express ourselves. We depend so much on it that we recognize its importance when resorting to other communication forms like emails and text messages where we often use emojis to express the emotions associated with the messages. As emotions play a vital role in communication, the detection and analysis of the same are of vital importance in today’s digital world of remote communication.
Emotion detection is a challenging task because emotions are subjective. There is no common consensus on how to measure or categorize them. We define an SER system as a collection of methodologies that process and classify speech signals to detect emotions embedded in them.
Such a system can find use in a wide variety of application areas like interactive voice based-assistant or caller-agent conversation analysis. In this study, we attempt to detect underlying emotions in a recorded speech by analyzing the acoustic features of the audio data of recordings.
There are 5 Different features we need to get from the audio dataset and then fuse them in a vector.
- Mel spectrogram: MelSpectrogram represents an acoustic time-frequency representation of a sound
- Mfcc: MFCC’s signal is a small set of features that concisely describe the overall shape of a spectral envelope
- chroma-sqft: Chroma features are an interesting and powerful representation for music audio
- spectral_contrast: Spectral contrast is defined as the decibel difference between peaks and valleys in the spectrum.
- tonnetz: Computes the tonal centroid features
The above features are very essential in building Speech Emotion Recognition.
For achieving the Above tasks we will be using a Python library called Libros
- Audio Dataset
- Python and Dependencies installed [Tensorflow, Keras, Matplotlib, Numpy]:
Steps to Follow:
1. Importing essential Libraries
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation from keras.layers import Dropout
from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
2. Now We have to extract features and parse audio wav from our dataset for that I have written a function
X, sample_rate = librosa.load(file_name)
stft = np.abs(librosa.stft(X))
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X),
return mfccs,chroma,mel,contrast,tonnetzdef parse_audio_files(parent_dir,sub_dirs,file_ext="*.wav"):
features, labels = np.empty((0,193)), np.empty(0)
for label, sub_dir in enumerate(sub_dirs):
for fn in glob.glob(os.path.join(parent_dir, sub_dir, file_ext)):
mfccs, chroma, mel, contrast,tonnetz = extract_feature(fn)
except Exception as e:
print ("Error encountered while parsing file: ", fn)
ext_features = np.hstack([mfccs,chroma,mel,contrast,tonnetz])
features = np.vstack([features,ext_features])
labels = np.append(labels, fn.split('\\').split('-'))
return np.array(features), np.array(labels, dtype = np.int)
3. Now we are required to Hot Encode our Dataset labels to convert our categorical data
n_labels = len(labels)
n_unique_labels = len(np.unique(labels)) one_hot_encode = np.zeros((n_labels,n_unique_labels+1)) one_hot_encode[np.arange(n_labels), labels] = 1 one_hot_encode=np.delete(one_hot_encode, 0, axis=1) return one_hot_encode
4. Now we need to save all our essential extracted features into variables X and Y
main_dir = 'D:\Audio_Speech_Actors_01-24' sub_dir=os.listdir(main_dir)
print ("\ncollecting features and labels...") print("\nthis will take some time...") features, labels = parse_audio_files(main_dir,sub_dir) print("done")
labels = one_hot_encode(labels) np.save=('y', labels)
5. As the feature extraction is done in the above steps we need to load our data and then split it for training and testing
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.33, random_state=20)
6. Now one of the Important parts of this ML model is its Deep Neural network architecture that is as follow:
n_dim = train_x.shape
n_classes = train_y.shape n_hidden_units_1 = n_dim
n_hidden_units_2 = 400
n_hidden_units_3 = 200
n_hidden_units_4 = 100
#defining the model
optimiser='adam', dropout_rate=0.2): model = Sequential()
# layer 1 model.add(Dense(n_hidden_units_1, input_dim=n_dim, activation=activation_function))
# layer 2 model.add(Dense(n_hidden_units_2, activation=activation_function)) model.add(Dropout(dropout_rate))
# layer 3 model.add(Dense(n_hidden_units_3, activation=activation_function)) model.add(Dropout(dropout_rate))
model.add(Dense(n_hidden_units_4, activation=activation_function)) model.add(Dropout(dropout_rate))
# output layer
model.add(Dense(n_classes, activation='softmax')) #model compilation model.compile(loss='categorical_crossentropy', optimizer=optimiser, metrics=['accuracy'])
7. Now its Time to Fit our Model
model = create_model()
#train the model
history = model.fit(train_x, train_y, epochs=200, batch_size=4)
964/964 [==============================] - 2s - loss: 2.2671 - acc: 0.1494
964/964 [==============================] - 1s - loss: 1.9933 - acc: 0.2106
964/964 [==============================] - 1s - loss: 1.9295 - acc: 0.2106
964/964 [==============================] - 1s - loss: 1.8740 - acc: 0.2355
964/964 [==============================] - 1s - loss: 0.1319 - acc: 0.7021
964/964 [==============================] - 1s - loss: 0.4685 - acc: 0.6302
8. Now we Predict our Model and its accuracy using
9. Now we will perform EDA on our Model to get a better understanding
What is EDA?
In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
What is Confusion Matrix?
A confusion matrix is a matrix (table) that can be used to measure the performance of a machine-learning algorithm, usually a supervised learning one. Each row of the confusion matrix represents the instances of an actual class and each column represents the instances of a predicted class
emotions=['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised'] y_pred = np.argmax(predict, 1) predicted_emo=
for i in range(0,test_y.shape): emo=emotions[y_pred[i]] predicted_emo.append(emo)actual_emo= y_true=np.argmax(test_y, 1)
for i in range(0,test_y.shape):
actual_emo.append(emo)cm =confusion_matrix(actual_emo, predicted_emo) index = ['angry', 'calm', 'disgust', 'fearful', 'happy', 'neutral', 'sad', 'surprised']
columns = ['angry', 'calm', 'disgust', 'fearful', 'happy', 'neutral', 'sad', 'surprised']
cm_df = pd.DataFrame(cm,index,columns) plt.figure(figsize=(10,6)) sns.heatmap(cm_df, annot=True)
The above result is the heatmap of our model created using the Seaborn library in python.
In this Machine Learning Project, I have to build a Speech Emotion Recognition (SEM) system. I have used Keras and Tensorflow as a Backend to create DNN architecture. Which will take voice as an input and then it will Extract the essential features to make the Deep Neural Network model more Efficient and Accurate
Speech Emotion Recognition is a very useful Concept it can be implemented in many ways.
For Example, we can use this model to monitor the emotions and mental health of an employee in an organization which can help us to maintain the balance in Professional and Personal Life and the individual can get help quickly to prevent depression and another disease
It can also be used to prevent Hate speeches on the internet and other platforms and can be beneficial in industries like Security, education, etc.