A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition

Standard

A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition. / Матвеев, Антон Юрьевич; Матвеев, Юрий Николаевич ; Фролова, Ольга Владимировна ; Николаев, Александр Сергеевич ; Ляксо, Елена Евгеньевна.

In: Mathematics, Vol. 11, No. 22, 4573, 07.11.2023.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{44c4dfd715a948b2a2984e0ee6569862,

title = "A Neural Network Architecture for Children{\textquoteright}s Audio–Visual Emotion Recognition",

abstract = "Detecting and understanding emotions are critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio–visual speech. In this work, we investigate the automatic classification of the audio–visual emotional speech of children, which presents several challenges including the lack of publicly available annotated datasets and the low performance of the state-of-the art audio–visual ER systems. In this paper, we present a new corpus of children{\textquoteright}s audio–visual emotional speech that we collected. Then, we propose a neural network solution that improves the utilization of the temporal relationships between audio and video modalities in the cross-modal fusion for children{\textquoteright}s audio–visual emotion recognition. We select a state-of-the-art neural network architecture as a baseline and present several modifications focused on a deeper learning of the cross-modal temporal relationships using attention. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal temporal relationships may be beneficial for building ER systems for child–machine communications and environments where qualified professionals work with children.",

keywords = "audio–visual speech; emotion recognition; children",

author = "Матвеев, {Антон Юрьевич} and Матвеев, {Юрий Николаевич} and Фролова, {Ольга Владимировна} and Николаев, {Александр Сергеевич} and Ляксо, {Елена Евгеньевна}",

note = "This research was financially supported by the Russian Science Foundation, grant number № 22-45-02007.",

year = "2023",

month = nov,

day = "7",

doi = "10.3390/math11224573",

language = "English",

volume = "11",

journal = "Mathematics",

issn = "2227-7390",

publisher = "MDPI AG",

number = "22",

}

RIS

TY - JOUR

T1 - A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition

AU - Матвеев, Антон Юрьевич

AU - Матвеев, Юрий Николаевич

AU - Фролова, Ольга Владимировна

AU - Николаев, Александр Сергеевич

AU - Ляксо, Елена Евгеньевна

N1 - This research was financially supported by the Russian Science Foundation, grant number № 22-45-02007.

PY - 2023/11/7

Y1 - 2023/11/7

N2 - Detecting and understanding emotions are critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio–visual speech. In this work, we investigate the automatic classification of the audio–visual emotional speech of children, which presents several challenges including the lack of publicly available annotated datasets and the low performance of the state-of-the art audio–visual ER systems. In this paper, we present a new corpus of children’s audio–visual emotional speech that we collected. Then, we propose a neural network solution that improves the utilization of the temporal relationships between audio and video modalities in the cross-modal fusion for children’s audio–visual emotion recognition. We select a state-of-the-art neural network architecture as a baseline and present several modifications focused on a deeper learning of the cross-modal temporal relationships using attention. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal temporal relationships may be beneficial for building ER systems for child–machine communications and environments where qualified professionals work with children.

AB - Detecting and understanding emotions are critical for our daily activities. As emotion recognition (ER) systems develop, we start looking at more difficult cases than just acted adult audio–visual speech. In this work, we investigate the automatic classification of the audio–visual emotional speech of children, which presents several challenges including the lack of publicly available annotated datasets and the low performance of the state-of-the art audio–visual ER systems. In this paper, we present a new corpus of children’s audio–visual emotional speech that we collected. Then, we propose a neural network solution that improves the utilization of the temporal relationships between audio and video modalities in the cross-modal fusion for children’s audio–visual emotion recognition. We select a state-of-the-art neural network architecture as a baseline and present several modifications focused on a deeper learning of the cross-modal temporal relationships using attention. By conducting experiments with our proposed approach and the selected baseline model, we observe a relative improvement in performance by 2%. Finally, we conclude that focusing more on the cross-modal temporal relationships may be beneficial for building ER systems for child–machine communications and environments where qualified professionals work with children.

KW - audio–visual speech; emotion recognition; children

UR - https://www.mdpi.com/2227-7390/11/22/4573

UR - https://www.mendeley.com/catalogue/002e518d-21db-347e-af82-73bee37f0ce2/

U2 - 10.3390/math11224573

DO - 10.3390/math11224573

M3 - Article

VL - 11

JO - Mathematics

JF - Mathematics

SN - 2227-7390

IS - 22

M1 - 4573

ER -

ID: 114877898