Brooke Stephenson | LIG - Université Grenoble Alpes

Beyond words: leveraging language models for incremental and context-aware text-to-speech synthesis

Mardi 26 Septembre 2023

Abstract:

Text-to-speech (TTS) technology has the potential to enable real-time communication for applications such as automatic interpreters or assistive technologies for the speech impaired. However, current TTS models are not optimized for such use cases because they require full-sentence inputs, leading to delays between conversation turns. Furthermore, these models are unaware of the surrounding context and are thus unable to adapt their prosody to suit the current situation. These limitations impede engagement and understanding. In this thesis, we aim to improve the suitability of TTS for interactive applications by addressing two main challenges. Firstly, we focus on reducing the time required to initiate speech synthesis while at the same time maintaining natural prosody. Secondly, we explore the prediction of appropriate prosodic features for a given linguistic context. Language models (LMs), known for their effectiveness in natural language processing tasks, are employed as a primary tool for investigation for both of these challenges.

We begin by investigating the importance of degrees of lookahead (i.e., future words) for a vanilla, full-sentence TTS model. We do this by measuring the distance between the final internal representation of a word (i.e., when the full sentence is known) and the intermediate representations at each degree of lookahead. We also compare the prosodic quality of the outputs with a subjective test. Finally, we use random forest analysis to study which factors contribute the most to the stability of the internal representations (i.e., to determine whether the representation is likely to change or not). These tests show that word representations are shaped mostly by the next two words of lookahead and that word length is the largest predictor of stability.

We then investigate the use of pseudo-future text (generated by a language model) to enhance incremental text-to-speech (iTTS) synthesis. By leveraging linguistic clues present in the already provided text, language models anticipate the future context, filling in missing information for prosody modelling purposes. The objective and perceptual evaluations carried out show that this approach offers a good compromise between responsiveness and naturalness of synthesis, but remains highly dependent on the quality of text prediction.

Finally, we address the challenge of producing contextually appropriate speech. We identify an aspect of prosody modelling, contrastive focus on personal pronouns, which can be particularly challenging due to the high-level discursive knowledge which is often required for correct prediction. We evaluate the contribution pretrained LMs can make to this task compared to less linguistically sophisticated baselines. We also compare prediction accuracy with different amounts of context and test the control of prominence in the speech output. We go on to evaluate the use of LMs to guide speech segmentation for high input latency applications. We compare LM-informed methods with simpler count-based methods using subjective tests and a sentence verification test

Date et Lieu

Mardi 26 Septembre 2023 à 14h30
Amphithéâtre, Maison Jean Kunztmann
et zoom

Composition du Jury

Damien LOLIVE
Professeur, Université de Rennes 1, Rapporteur
Mireia FARRUS CABECERAN
Professeure Associée, Universitat de Barcelona, Rapporteure
Joakim GUSTAFSON
Professeur, KTH Royal Institute of Technology, Examinateur
Olivier KRAIF
Professeur, Université Grenoble Alpes, Président du jury
Thomas HUEBER
Directeur de Recherche CNRS, Directeur de thèse
Laurent BESACIER
Chercheur Senior, NAVER LABS Europe, Co-Directeur de thèse