Skip to main content

Yoann Dupas

Wednesday, January 14, 2026

Advanced Early Fusion Approaches for Multimodal Image Fusion for Pedestrian and Vehicle Detection

Abstract:

Improving artificial perception, particularly the detection of pedestrians and vehicles, is a significant challenge for advanced driver assistance systems (ADAS). This improvement is crucial for preventing accidents, especially in adverse weather conditions such as low light, rain, snow, and fog. This thesis addresses these challenges by focusing on multimodal fusion of images from complementary sensors (visible cameras, infrared cameras, and LiDAR) while leveraging the robustness of existing monomodal approaches without altering their architecture. The first contribution introduces the MEFA (Multimodal Early Fusion with Attention) module. By integrating global and local attention, the MEFA module effectively combines data from three modalities to generate an intermediate image. This intermediate image can be used directly by state-of-the-art monomodal object detectors, such as YOLO or RT-DETR, without requiring major architectural changes. To reduce the computational overhead of the initial approach, a second contribution called MEFA-MS was developed to balance accuracy and latency. MEFA-MS is based on a U-Net-type encoder-decoder architecture that integrates spatial and channel attention mechanisms. This significantly reduces inference time while maintaining or improving detection accuracy in complex conditions, such as rain, fog, or snow. Finally, preliminary work explores the integration of multi-head cross-modal attention mechanisms (MEFA-CMS) to improve robustness, better manage misalignment between modalities, and strengthen the robustness of feature fusion. Experiments conducted on the DENSE database demonstrate that the various modules proposed in this thesis significantly improve detection, particularly in difficult weather conditions, by improving accuracy and offering an optimal compromise between performance and speed. In short, this thesis combines an innovative early fusion approach with the dynamic use of attention mechanisms to adapt monomodal approaches to a multimodal context. The results obtained thus provide a path toward robust and scalable perception solutions that meet the requirements of next-generation road safety systems and connected mobility, while laying the groundwork for future developments aimed at further optimizing the consideration of real-time and energy constraints.

Keywords :  Object detection, Multimodal fusion, Deep Learning, Convolutional neural network, Attention, Multi-scale, Transformer, ADAS

Date and place

Wednesday, January 14 at 14:00
IMAG building, in the amphitheater on the ground floor

Jury members

Thesis supervision:
Denis Trystram
Directeur de thèse, Professeur des Universités, Grenoble INP - UGA
Christophe Cérin
Co-directeur de thèse, Professeur des Universités, Université Sorbonne Paris Nord
Olivier Hotel
Co-encadrant en entreprise, Chercheur, Orange
Grégoire Lefebvre
Co-encadrant en entreprise, Chercheur, Orange
 
Thesis committee:
Denis Trystram
Directeur de thèse, Professeur des Universités, Grenoble INP - UGA
Christophe Cérin
Co-directeur de thèse, Professeur des Universités, Université Sorbonne Paris Nord
Sylvie Chambon
Rapporteure, Professeure des Universités, Toulouse INP
Sidi-Mohammed Senouci
Rapporteur,  Professeur des Universités, Université Bourgogne Europe
Pia Bideau
Examinatrice, Chargée de recherche, Centre INRIA Université Grenoble Alpes
Maria Malek
Examinatrice, Maîtresse de Conférences, Université de Cergy Pontoise
Laure Tougne
Examinatrice, Professeure des Universités, Université Lumière Lyon 2
Olivier Aycard
Examinateur, Professeur des Universités, Grenoble INP - UGA
Grégoire Lefebvre
Co-encadrant en entreprise, Chercheur, Orange
Olivier Hotel
Co-encadrant en entreprise, Chercheur, Orange

Submitted on January 8, 2026

Updated on January 8, 2026