Article, 2024

CNN-Transformer based emotion classification from facial expressions and body gestures

Multimedia Tools and Applications, ISSN 1380-7501, Volume 83, 8, Pages 23129-23171, 10.1007/s11042-023-16342-5

Contributors

Karatay B. Bestepe D. 0000-0003-1933-410X [1] Sailunaz K. 0000-0001-8751-4108 [2] Ozyer T. 0000-0002-2529-5533 [3] Alhajj R. 0000-0001-6657-9738 (Corresponding author) [1] [2] [4]

Affiliations

  1. [1] University of Calgary
  2. [NORA names: Canada; America, North; OECD];
  3. [2] Istanbul Medipol University
  4. [NORA names: Turkey; Asia, Middle East; OECD];
  5. [3] Ankara Medipol University
  6. [NORA names: Turkey; Asia, Middle East; OECD];
  7. [4] University of Southern Denmark
  8. [NORA names: SDU University of Southern Denmark; University; Denmark; Europe, EU; Nordic; OECD]

Abstract

Classifying the correct emotion from different data sources such as text, images, videos, and speech has been an inspiring research area for researchers from various disciplines. Automatic emotion detection from videos and images is one of the most challenging tasks that have been analyzed using supervised and unsupervised machine learning methods. Deep learning has been also employed where the model has been trained by facial and body features using pose and landmark detectors and trackers. In this paper, facial and body features extracted by the OpenPose tool have been used for detecting basic 6, 7 and 9 emotions from videos and images by a novel deep neural network framework which combines the Gaussian mixture model with CNN, LSTM and Transformer to generate the CNN-LSTM model and CNN-Transformer model with and without Gaussian centers. The experiments which were conducted using two benchmark datasets, namely FABO and CK+, showed that the proposed transformer model with 9 and 12 Gaussian centers with video generation approach was able to achieve close to 100% classification accuracy for the FABO dataset which outperforms the other DNN frameworks for emotion detection. It reported over 90% accuracy for most combinations of features for both datasets leading to a comparable framework for video emotion classification.

Keywords

Body gesture, CNN, Emotion classification, Emotion detection, Gaussian mixture model, LSTM, Transformer

Funders

  • Türkiye Bilimsel ve Teknolojik Araştırma Kurumu

Data Provider: Elsevier