Skip to main content

Anderson Augusma

Friday, January 16, 2026

Multimodal Group Emotion Recognition In-the-Wild: Towards a Privacy-Safe Non-Individual Approach

This thesis addresses the challenge of group emotion recognition (GER) in-the-wild. Traditional approaches to emotion recognition often rely on individual-level cues such as facial recognition, gaze tracking, or voice profiling. While effective in some settings, these methods raise serious concerns about privacy and surveillance. To overcome these limitations, this thesis prioritizes privacy preservation by leveraging only collective audio–visual signals, focusing on group-level rather than individual-level emotion recognition. The overall objective is to develop multimodal models that can infer group emotions while avoiding the risks associated with individual monitoring and surveillance. Two complementary frameworks are proposed to achieve this goal. The first introduces a cross-attention multimodal architecture for audio–video fusion, combined with a Frames Attention Pooling (FAP) strategy. This framework is further supported by synthetic data augmentation and validated through extensive ablation studies. These experiments demonstrate his effectiveness and robustness for GER in real-world conditions. The second, the Variational Encoder Multi-Decoder (VE-MD), introduces a shared latent space jointly optimized for emotion classification, body, and face structural representation prediction. Two structural representation decoding strategies are explored: DETR-based and heatmap-based, highlighting their respective strengths and limitations in group versus individual settings. A detailed analysis reveals how structural representation integration impacts GER differently compared to non-GER.The scientific contributions of this thesis are threefold. First, it provides new insights into the role of multimodality and structural representation-based cues for group-level affective computing, clarifying how group and individual settings diverge in their requirements and challenges. Second, it advances methodological design through the introduction of two complementary frameworks: a cross-attention fusion model with FAP for temporal aggregation, and VE-MD as a generalizable latent space for multitask learning. Third, it establishes a privacy-preserving paradigm for GER, showing that competitive or state-of-the-art performance can be achieved without relying on individual features as input data.

Date and place

Friday, January 16, 2026  at 14:00
Auditorium, IMAG Building
And Zoom

Jury members

Rapporteurs
Alessandro VINCIARELLI
Full Professor, University of Glasgow
Antitza DANTCHEVA
Directrice de Recherche, Centre de l’INRIA Université Côte d’Azur à Sophia Antipolis

Examinateurs
Christine KERIBIN
Professeure des Universités, Université Paris-Saclay
Bernd DUDZIK
Assistant professor, Delft University of Technology (TU Delft)
Didier SCHWAB
Professeure des Universités, Université Grenoble Alpes

Encadrants de Thèse
Dominique VAUFREYDAZ
Professeur des Universités, Université de Grenoble Alpes
Frédérique LETUE
Maitresse de Conférence, Université de Grenoble Alpes

Submitted on January 8, 2026

Updated on January 9, 2026