[ABSTRACT]
The introduction of speech translation corpora, which have speech signals aligned with the corresponding translation texts, coupled with the steady growth in the computational capacity, plays a crucial role in making the training of neural end-to-end speech-to-text translation feasible. The endeavor of this thesis is exploring neural approaches for end-to-end speech-to-text translation, which shall be referred to as Automatic Speech Translation, particularly focusing on two types of end-to-end translation systems: (1) Offline Speech Translation and (2) Online Speech Translation.
 
With respect to offline speech translation, we build strong end-to-end baselines for two language pairs English-to-Portuguese and English-to-German. They are based on VGG-like Convolutional Neural Network blocks coupled with Long Short-Term Memory layers at the encoder side and a stack of Long Short-Term Memory layers at the decoder side. We investigate different data augmentation techniques as well as different target token units (characters, Byte Pair Encoding of different sizes) and validate those baselines through our participation in international shared tasks on speech translation. Besides, we put Self-Supervised Learning from speech representations, particularly pre-trained English wav2vec, into a comparison with the conventional speech representations including Mel filter-bank and MFCC features, when applied to the speech translation task, specifically in low and medium-resource scenarios, when we have less than 100 hours of training data. We explain through analyses that wav2vec features might be better at discriminating phones, better at aligning source and target sequence, and more robust to speaker variability. Last but not least, we train our own Self-Supervised Learning models from a large amount of unlabelled French speech data, which are then proven effective for a wide range of speech tasks that are included in a reproducible framework for accessing self-supervised representation learning from speech named LeBenchmark.
 
As regards online speech translation, we adapt wait-k policy, which is originally proposed for online text-to-text translation, for the speech translation task, and advocate for using Unidirectional instead of Bidirectional Long Short-Term Memory speech encoders for online speech translation. We propose a new encoding strategy named Unidirectional Long Short-Term Memory Overlap-and-Compensate, which allows Unidirectional Long Short-Term Memory-based speech encoders to work more effectively in online speech translation. We evaluate both the decoding and encoding strategies firstly on the ability to leverage pre-trained offline end-to-end speech translation models for the online translation task. Furthermore, we propose to fine-tune these pre-trained models in a training mode more adapted to online translation to further boost the performance of the online translation systems. In addition, other aspects of online speech translation, for instance, the impact of input speech segmentation, the impact of output granularity, and different fine-tuning scenarios, are also investigated.
 
Keywords: End-to-end automatic speech translation, neural machine translation, low latency speech translation, self-supervised learning for speech translation.