Cécile Macaire | LIG - Université Grenoble Alpes

Tuesday, March 11, 2025

Automatic speech translation into pictograms

Abstract:

Augmentative and Alternative Communication (AAC) provides methods and tools to address impairments in speech production and comprehension. Pictograms, key elements of AAC, facilitate the communication of thoughts and emotions through simplified iconography. However, myths and economic barriers hinder its widespread adoption, highlighting the need for tailored solutions. Automatic Speech-to-Pictogram translation, a new task in Natural Language Processing (NLP), aims to generate pictogram sequences from spoken utterances. At the intersection of AAC and Speech-to-text translation (ST), this task can facilitate communication between caregivers (medical staff, family members) and individuals with language disorders. Nevertheless, it faces major challenges, including a lack of unified multimodal data, the absence of a precise evaluation framework, and the need for specialized neural models to perform pictogram translation.
In this thesis, we present three contributions to address these challenges. We introduce two methods for creating multimodal corpora aligning speech, text, and pictograms. The first method includes a grammar and a restricted vocabulary to generate a sequence of pictograms from the transcription, while the second integrates a processing pipeline to retrieve the audio from texts already translated into pictograms. Together, these methods create robust datasets for model training and evaluation.
In our second contribution, we define a specialized evaluation framework, combining both automatic and human evaluations. We adapt metrics commonly used in Automatic Speech Recognition (ASR) and Machine Translation (MT) to effectively compare models' performance. Additionally, we apply an analytical framework to interpret the quality of the translations.
Finally, in our third contribution, we investigate two approaches, cascade and end-to-end, for generating pictogram sequences from speech. We compare state-of-the-art ASR, MT, and ST models, trained or fine-tuned on the multimodal data created. Our evaluation results demonstrate the ability of cascade models to produce intelligible pictogram translations from read speech in everyday life situations. We also achieve competitive results with an end-to-end model for spontaneous speech, an ongoing challenge in NLP. The code, data, and models developed are freely available.

Date et lieu

Tuesday, March 11, 2025, at 2:00 PM

Maison Jean Kuntzmann (MJK), Amphithéâtre

Visio

Composition du jury

Benjamin LECOUTEUX
Professeur des Universités, Université Grenoble Alpes, Directeur de thèse
Iris ESHKOL-TARAVELLA
Professeure des Universités, Université Paris 10 - Nanterre, Rapporteure
Frédéric BÉCHET
Professeur des Universités, Aix-Marseille Université, Rapporteur
Didier SCHWAB
Professeur des Universités, Université Grenoble Alpes, Co-directeur de thèse
Nathalie CAMELIN
Maîtresse de Conférences, Avignon Université, Examinateur
François PORTET
Professeur des Universités, Université Grenoble Alpes, Examinateur