Audio emotion recognition using machine learning

doi:10.1201/9781003333500-39

ABSTRACT

Over the last decade, the study of emotion recognition has attracted a lot of attention in the area of human-computer interaction. Current recognition accuracy can be improved, but more research into the fundamental temporal link between speech waveforms is needed. A method for speech recognition is proposed that takes advantage of difference in emotional saturation between time frames (RNN) using combination of speech features with attention-based Long Short-term Memory (LSTM) recurrent neural networks. In place of standard statistical features, frame level speech features were derived from the waveform to retain the original speech's temporal relations through sequence of frames. Two LSTM enhancement algorithms based on the attention mechanism have been presented to distinguish emotional saturation in distinct frames. An Emotion Recognition in Conversion system that is capable of recognizing face emotion in real-time was one of the systems that was proposed.