Skip to main content

Sina Alisamir

Mardi 3 Octobre 2023

The abstract
Automatic emotion recognition (AER) from text, or audio recordings of natural human-human or human-machine interactions, is a technology that can have an impact in areas as diverse as education, health and entertainment. Although existing AER systems can work well in specific scenarios, they are not yet robust enough to deal with different environments, speakers and microphones (i.e. in the wild). In this thesis, several contributions have been made to advance the research on AER in the wild.
State-of-the-art AER systems use data-driven machine learning methods to recognise emotion from numerical representations of acoustic signals or text. One contribution of this thesis is to investigate the fusion of speech representations and their corresponding textual transcriptions for AER on both acted and in-the-wild data. In addition, as human transcriptions are not always available, existing Automatic Speech Recognition (ASR) systems are further explored within the same paradigm. The results show that the use of fused acoustic-textual representations can achieve better AER performance for acted and in-the-wild data than using the representation of each modality alone. The acoustic-textual representations were further fused with speaker representations, resulting in additional improvement in AER performance for acted data. %The better AER performance when taking into account acoustic changes, uttered words and speaker identity is consistent with the notion that emotion is conveyed by what is said, how it is said, and who said it.
Moreover, as emotion is a subjective concept with no universal definition, it is annotated and used in various ways across different AER systems. To address this issue, this thesis proposes a method for training a model on different datasets with different emotion annotations. The proposed method is composed of one model that is trained across multiple datasets, which computes the generic latent emotion representation, and several specific models, which can map the emotion representation to the set of emotion labels specific to each dataset. The results suggest that the proposed method can produce emotion representations that can relate the same or similar emotion labels across different datasets with different annotation schemes. Finally, by combining the proposed method with joint acoustic-textual representations, it was shown that this method can leverage acted data to improve the performance of AER in the wild.

Date et Lieu

Mardi 3 Octobre 2023 à 15:30
Salle séminaire 1, Bâtiment IMAG
et visioconférence

Composition du Jury

Président- PROFESSEUR DES UNIVERSITES, Université Grenoble Alpes
Rapporteur- PROFESSEUR DES UNIVERSITES, Sorbonnes Université
Rapporteure-ASSOCIATE PROFESSOR, University of Michigan
Examinatrice-DIRECTEUR DE RECHERCHE, Sorbonnes Université
Hussein AL OSMAN
Examinateur-ASSOCIATE PROFESSOR, University of Ottawa
Florian EYBEN

Submitted on September 19, 2023

Updated on September 19, 2023