Unified Self-Supervised Encoders for French Text and Speech
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn
Pantagruel is a family of self-supervised encoder models for French text and speech, trained within a unified
representation learning framework. This work is the result of collaboration between Université Grenoble Alpes (LIG), Institut National de l'Audiovisuel (INA), Avignon Université (LIA), Institut Polytechnique de Paris (CREST), Université Paris Cité (LLF), and Université Bretagne Sud (IRISA).
Pantagruel leverages feature-space predictive objectives (JEPA / data2vec 2.0) to train both modalities using the same learning framework.
For speech, we trained models using the data2vec 2.0 masked feature prediction objective on diverse French audio data: Multilingual LibriSpeech (~1K h), LeBenchmark (~14K h), and INA-100k, a newly introduced 100,000-hour corpus of French broadcast speech contributed by INA.
For text, we combine data2vec-style feature-space prediction with masked language modeling (MLM) to capture both contextual and fine-grained linguistic information. Different models were trained on Wikipedia (4GB), OSCAR (138GB), and CroissantLLM (1.5TB) datasets.
Key highlights:
- Unified training framework for French text & speech encoders
- JEPA / data2vec 2.0-style masked feature prediction
- Hybrid data2vec 2.0 + MLM for text encoders
- Competitive results compared to established French baselines
Models released on Hugging Face
📄 Paper: https://arxiv.org/abs/2601.05911
🤗 Models: https://huggingface.co/PantagrueLLM
- Speech models:
https://huggingface.co/collections/PantagrueLLM/speech-only-models
- Text models:
https://huggingface.co/collections/PantagrueLLM/text-only-models
We hope Pantagruel serves as a useful resource for the research community working on French speech and text modeling.
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn