Thursday, February 27, 2025
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn
What do multimodal tokenizers understand about conversation?
Abstract
Large language models have changed the way we can interact with computers through language. Recent developments have introduced versatile multimodal capabilities into those models through adapters that project audio, image or video representations into the textual token space. Those models bring an opportunity for better modeling and understanding human-human and human-machine multimodal interactions.
In this talk, I will describe an ongoing project on building foundation models of multimodal conversation. Drawing from video generation techniques, this project aims at assessing the representation capabilities of models able to continue a human-human conversation, both in the audio and video modality. I will describe a framework for studying those capabilities through probing underlying representations according to various tasks, such as detecting backchannels in dyadic conversations. This project has applications in studying human conversation behavior, as well as simulating this behavior in the context of robotic interactions.
Date and place
Thursday, February 27 at 14:00
GIPSA-lab in room B314
Speacker
Benoit Favre
Laboratoire LIS (Université Aix-Marseille)
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn