Skip to main content

Hang Le

Monday March, 25 th, 2024

Model Architectures and Training Techniques for Multilingual Speech-to-Text Translation

Abstract:
 
Speech-to-text translation (ST) consists in translating a speech audio input in one language into a text output in another language. This task is highly challenging due to its multimodal and multilingual nature (the former involves both speech and text modalities, while the latter involves more than one language). In this thesis, we make three major contributions spanning two primary research areas of ST, namely model architectures and training techniques.    
 
First, in terms of model architectures, we introduce the dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual ST. Our model consists of two decoders, each responsible for one task (ASR or ST), that can interact with each other through a novel dual-attention mechanism. This design allows the decoders to specialize in their respective tasks while being helpful to each other. We propose two variants, called the parallel and cross dual-decoder Transformers, corresponding to two different levels of dependencies between the decoders. The proposed model also generalizes existing approaches using two independent or weakly tied decoders. Experiments on standard benchmarks show that our models outperform previous work in terms of translation performance under both bilingual and multilingual settings.   
 
Second, in terms of training techniques, we propose a parameter-efficient fine-tuning approach based on adapter modules. We show that language-specific adapters can enable a fully trained multilingual ST model to be further specialized in each language pair. With these adapter modules, one can efficiently obtain a single multilingual ST system that outperforms the original multilingual model as well as multiple bilingual systems while maintaining a low storage cost and simplicity in deployment. In addition, we show that adapters can also be used to connect available pre-trained models such as an ASR model and a multilingual denoising auto-encoder to form strong multilingual ST systems.   
 
Finally, as a second contribution in training techniques, we propose an effective supervised pre-training method to address the so-called speech-text modality gap, a well-known major challenge in ST. Our method combines the connectionist temporal classification loss and optimal transport in a Siamese-like model. This model is composed of two encoders, one for acoustic inputs and the other for textual inputs, which are trained such that they produce representations that are close to each other in the Wasserstein space. Extensive experiments on standard benchmarks show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. Finally, our method can also be applied on top of these multi-task systems, leading to further improvements for these models.

Date and place

Monday March, 15 th, 2024, 14:00
IMAG Building
and Zoom

Jury members

Didier Schwab
Université Grenoble Alpes, Advisor 
Benjamin Lecouteux
Université Grenoble Alpes, Co-advisor 
Fédéric Béchet
Aix-Marseille Université, CNRS, LIS, Reviewer 
François Yvon
Sorbonne Université, CNRS, ISIR, Reviewer 
Juan Pino
Meta AI, Examiner
Laurent Besacier
NAVER LABS Europe, Examiner 
Caroline Rossi
Université Grenoble Alpes, Examiner

Submitted on March 18, 2024

Updated on March 18, 2024