Friday, January 16, 2026
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn
Multimodal Group Emotion Recognition In-the-Wild: Towards a Privacy-Safe Non-Individual Approach
This thesis addresses the challenge of group emotion recognition (GER) in-the-wild. Traditional approaches to emotion recognition often rely on individual-level cues such as facial recognition, gaze tracking, or voice profiling. While effective in some settings, these methods raise serious concerns about privacy and surveillance. To overcome these limitations, this thesis prioritizes privacy preservation by leveraging only collective audio–visual signals, focusing on group-level rather than individual-level emotion recognition. The overall objective is to develop multimodal models that can infer group emotions while avoiding the risks associated with individual monitoring and surveillance. Two complementary frameworks are proposed to achieve this goal. The first introduces a cross-attention multimodal architecture for audio–video fusion, combined with a Frames Attention Pooling (FAP) strategy. This framework is further supported by synthetic data augmentation and validated through extensive ablation studies. These experiments demonstrate his effectiveness and robustness for GER in real-world conditions. The second, the Variational Encoder Multi-Decoder (VE-MD), introduces a shared latent space jointly optimized for emotion classification, body, and face structural representation prediction. Two structural representation decoding strategies are explored: DETR-based and heatmap-based, highlighting their respective strengths and limitations in group versus individual settings. A detailed analysis reveals how structural representation integration impacts GER differently compared to non-GER.The scientific contributions of this thesis are threefold. First, it provides new insights into the role of multimodality and structural representation-based cues for group-level affective computing, clarifying how group and individual settings diverge in their requirements and challenges. Second, it advances methodological design through the introduction of two complementary frameworks: a cross-attention fusion model with FAP for temporal aggregation, and VE-MD as a generalizable latent space for multitask learning. Third, it establishes a privacy-preserving paradigm for GER, showing that competitive or state-of-the-art performance can be achieved without relying on individual features as input data.
Date and place
Friday, January 16, 2026 at 14:00
Auditorium, IMAG Building
And Zoom
Jury members
Rapporteurs
Alessandro VINCIARELLI
Full Professor, University of Glasgow
Antitza DANTCHEVA
Directrice de Recherche, Centre de l’INRIA Université Côte d’Azur à Sophia Antipolis
Examinateurs
Christine KERIBIN
Professeure des Universités, Université Paris-Saclay
Bernd DUDZIK
Assistant professor, Delft University of Technology (TU Delft)
Didier SCHWAB
Professeure des Universités, Université Grenoble Alpes
Encadrants de Thèse
Dominique VAUFREYDAZ
Professeur des Universités, Université de Grenoble Alpes
Frédérique LETUE
Maitresse de Conférence, Université de Grenoble Alpes
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn