Thursday, December 18, 2025
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn
Towards Simpler Transcripts: Investigating Automatic Simplification of French Spontaneous Speech
Abstract:
Understanding the world relies on the human ability to abstract and simplify complex phenomena. In the context of natural language, simplification is essential to enable comprehension, particularly when forms of expression are perceived as complex. Prior literature often associates linguistic complexity with features typical of formal written texts. However, such complexity can also arise from spontaneous speech, which is often structurally irregular and characterized by the presence of disfluencies such as hesitations, repetitions or false starts.
In this thesis, we address the automation of speech simplification with a focus on spontaneous French as the input language, which remains a largely unexplored task in current research. While ATS has advanced significantly in the context of English written texts, little attention has been paid to the spoken modality in the same task. Furthermore, the shortage of parallel corpora in languages that are less resource-rich than English has posed an obstacle to its automation in the language under consideration. Therefore, the main objective of this research work is to bridge such gaps by proposing an artifact that automatically simplifies spontaneous French speech.
The thesis is structured around three main goals. First, we propose a characterization of simplification strategies specific to spontaneous French, as the task of spontaneous speech simplification has not been formally defined. To this end, we collect expert- and machine-based simplifications from utterances derived from the CEFC dataset, and then analyze the linguistic operations performed. The findings show a preference for deletions and the tendency to produce register-standardized sentences that solely retain the propositional content of the input.
Secondly, we tackle the challenge of limited task-specific parallel data through two data creation methods. The first method relies on the exploitation of register-differentiated comparable corpora (i.e., Wikipedia and Vikidia) to extract aligned complex-simpler sentence pairs, resulting in the WiViCo set, which contains 46k unspontaneous yet human-based pairs. The second method is based on synthetic data generation using LLMs through an iterative exo-refinement workflow. In this approach, separate LLMs are used for generation and evaluation, enabling external feedback loops and role specialization. Applied to CEFC transcripts, this process produces the CEFC-Synth dataset, which, despite being artificially generated, reflects more closely the spontaneous speech modality.
Finally, building on these resources, we introduce an artifact for French spontaneous speech simplification, trained on a combination of human- (i.e., WiViCo) and LLM-generated data (i.e., CEFC-Synth). We experiment with both cascade and end-to-end architectures, and evaluate their performance against a CEFC-based test set of expert-crafted simplification references, on the basis of automatic metrics. Results demonstrate that our proposed models notably outperform a state-of-the-art text simplification system, i.e., MUSS. Moreover, the inclusion of synthetic speech domain data in the training set proves beneficial, as evidenced by the results obtained in the transcript-to-simplification experiments.
With the contribution of new evaluation and training resources, different methods for task-specific data creation and an operational artifact, this thesis contributes to the generation of simpler transcripts from spontaneous French. This, in turn, has implications not only from an accessibility perspective (by enhancing the clarity of the message for diverse target audiences), but also from a computational standpoint, as providing intermediate simplified representations may improve performance in other downstream NLP tasks.
Date and place
Thursday, December 18 at 10:00
Room 6050 (6th floor) at Uni Mail (Université de Genève)
And Zoom
Jury members
Supervision :
Pierrette BOUILLON
Professeure ordinaire, Université de Genève
Benjamin LECOUTEUX
Professeur des universités, Université Grenoble Alpes
Didier SCHWAB
Professeur des universités, Université Grenoble Alpes
Jury members :
Yannick ESTÈVE
Professeur des universités, Avignon Université
Núria GALA
Professeur des universités, Aix Marseille Université
Annarita FELICI
Professeure associée, Université de Genève
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn