Audio-Visual Emotion Recognition System for Variable Length Spatio-Temporal Samples Using Deep Transfer-Learning

Author(s):  
Antonio Cano Montes ◽  
Luis A. Hernández Gómez
Sensors ◽  
2021 ◽  
Vol 21 (22) ◽  
pp. 7665
Author(s):  
Cristina Luna-Jiménez ◽  
David Griol ◽  
Zoraida Callejas ◽  
Ricardo Kleinlein ◽  
Juan M. Montero ◽  
...  

Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.


2021 ◽  
Vol 1757 (1) ◽  
pp. 012021
Author(s):  
Yuqiong Wang ◽  
Zehui Zhao ◽  
Zhiwei Huang

Sensors ◽  
2021 ◽  
Vol 21 (7) ◽  
pp. 2540
Author(s):  
Zhipeng Yu ◽  
Jianghai Zhao ◽  
Yucheng Wang ◽  
Linglong He ◽  
Shaonan Wang

In recent years, surface electromyography (sEMG)-based human–computer interaction has been developed to improve the quality of life for people. Gesture recognition based on the instantaneous values of sEMG has the advantages of accurate prediction and low latency. However, the low generalization ability of the hand gesture recognition method limits its application to new subjects and new hand gestures, and brings a heavy training burden. For this reason, based on a convolutional neural network, a transfer learning (TL) strategy for instantaneous gesture recognition is proposed to improve the generalization performance of the target network. CapgMyo and NinaPro DB1 are used to evaluate the validity of our proposed strategy. Compared with the non-transfer learning (non-TL) strategy, our proposed strategy improves the average accuracy of new subject and new gesture recognition by 18.7% and 8.74%, respectively, when up to three repeated gestures are employed. The TL strategy reduces the training time by a factor of three. Experiments verify the transferability of spatial features and the validity of the proposed strategy in improving the recognition accuracy of new subjects and new gestures, and reducing the training burden. The proposed TL strategy provides an effective way of improving the generalization ability of the gesture recognition system.


Sign in / Sign up

Export Citation Format

Share Document