Séminaire GIPSA-lab, Benoît Favre | LIG - Université Grenoble Alpes

Thursday, February 27, 2025

What do multimodal tokenizers understand about conversation?

Abstract

Large language models have changed the way we can interact with computers through language. Recent developments have introduced versatile multimodal capabilities into those models through adapters that project audio, image or video representations into the textual token space. Those models bring an opportunity for better modeling and understanding human-human and human-machine multimodal interactions.

In this talk, I will describe an ongoing project on building foundation models of multimodal conversation. Drawing from video generation techniques, this project aims at assessing the representation capabilities of models able to continue a human-human conversation, both in the audio and video modality. I will describe a framework for studying those capabilities through probing underlying representations according to various tasks, such as detecting backchannels in dyadic conversations. This project has applications in studying human conversation behavior, as well as simulating this behavior in the context of robotic interactions.

Date and place

Thursday, February 27 at 14:00
GIPSA-lab in room B314

Speacker

Benoit Favre
Laboratoire LIS (Université Aix-Marseille)