ABSTRACT

Emotion recognition from audio signals has been regarded as a challenging task in signal processing as it can be considered as a collection of static and dynamic classification tasks. Recognition of emotions from speech data has been heavily relied upon end-to-end feature extraction and classification using machine learning models, though the absence of feature selection and optimization has restrained the performance of these methods. Mel Frequency Cepstral Coefficients (MFCC) have emerged as one of the most relied upon feature extraction methods, though it circumscribes the accuracy of classification with a very small feature dimension. The technique, concatenation of features, which is extracted by using different existing feature extraction methods, can not only boost the classification accuracy but also expand the possibility of efficient feature selection. We have used Linear Predictive Coding (LPC) apart from the MFCC feature extraction method before feature merging. We propose a novel application of Manta Ray optimization in speech emotion recognition tasks that offered a state-of-the-art result in this field. We have evaluated the performance of our model using SAVEE and Emo-DB, two publicly available datasets, and have achieved classification accuracies of 97.49% and 97.68%, respectively, on these datasets.