Skip to main content

Sotheara Leang

Monday, December 8, 2025

Toward Robust Representation for Low-resource Automatic Speech Recognition

Abstract:

Human speech exhibits high variability due to individual physiological differences in speech production. Most acoustic features rely on absolute frequency measurements and inherently capture speaker information, making them not speaker-independent. To address these problems, automatic speech recognition (ASR) systems require extensive annotated training data, which is often scarce and costly to obtain, particularly for low-resource languages. This research aims to develop an intrinsic speaker gender-independent speech representation to reduce the amount of training data for ASR. Our first method incorporates dynamic features, motivated by the speech production and perception dynamics, to characterize the acoustic transition of the speech within the Spectral Subband Centroid Frequency (SSCF) space. We propose using polar parameters to efficiently capture the transition on the SSCF1-SSCF2 plane. These dynamic features are combined with conventional Mel-Frequency Cepstral Coefficients (MFCCs) to provide a richer representation. To further increase robustness against acoustic variation, we introduce polar-ratio features computed on the ratio plane of the studied SSCFs. To address tonal speech, we propose using SSCF0 as the pseudo-F0 feature. Our study indicates that the proposed method improves recognition performance and enhances speaker gender independence, specifically for French. As the first method shows limited generalization to Khmer and Vietnamese, our second method aims to learn a compact speech representation that primarily captures linguistic content while minimizing speaker-related information. We propose using a factorized vector quantization autoencoder, along with auxiliary supervision tasks, to disentangle speech into content, speaker, and residual representations. Our study shows that the content embeddings effectively capture detailed linguistic information, resulting in intelligible speech. In speech recognition, these embeddings significantly outperform the baseline and dynamic features, and exhibit greater robustness to speaker gender across French, Vietnamese, and Khmer in low-resource scenarios.

Date and place

Monday, December 8 at 14:00
Maison des Langues et des Cultures, Salle Jacques Cartier

Jury members

Thesis Supervision:
Eric CASTELLI
Chargé de Recherche HDR, CNRS Delegation Alpes, Directeur de thèse
Dominique VAUFREYDAZ
Professeur des Universités, Université Grenoble Alpes, Co-directeur de thèse
Sethserey SAM
Directeur Général de Recherche, Académie de Technologie Digital du Cambodge, Co-encadrant de thèse


Thesis Committee:
Eric CASTELLI
Chargé de Recherche HDR, CNRS Delegation Alpes, Directeur de thèse
Nicolas AUDIBERT
Maître de Conférences, Université Sorbonne-Nouvelle, Rapporteur
Jean-François BONASTRE
Directeur de Recherche, Agence Ministérielle pour l'IA de Défense, Rapporteur
Nathalie VALLÉE
Directrice de Recherche du CNRS, Gipsa-Lab, Université Grenoble Alpes, Examinatrice
Olivier CROUZET
Maître de Conférences, Nantes Université, Examinateur
Dominique VAUFREYDAZ
Professeur des Universités, Université Grenoble Alpes, Co-directeur de thèse
Sethserey SAM
Directeur Général de Recherche, Académie de Technologie Digital du Cambodge, Invité

Submitted on December 4, 2025

Updated on December 4, 2025