Skip to main content

New paper release Pantagruel

Unified Self-Supervised Encoders for French Text and Speech

Pantagruel

Pantagruel is a family of self-supervised encoder models for French text and speech, trained within a unified
representation learning framework. This work is the result of collaboration between Université Grenoble Alpes (LIG), Institut National de l'Audiovisuel (INA), Avignon Université (LIA), Institut Polytechnique de Paris (CREST), Université Paris Cité (LLF), and Université Bretagne Sud (IRISA).

Pantagruel leverages feature-space predictive objectives (JEPA / data2vec 2.0) to train both modalities using the same learning framework.

For speech, we trained models using the data2vec 2.0 masked feature prediction objective on diverse French audio data: Multilingual LibriSpeech (~1K h), LeBenchmark (~14K h), and INA-100k, a newly introduced 100,000-hour corpus of French broadcast speech contributed by INA.

For text, we combine data2vec-style feature-space prediction with masked language modeling (MLM) to capture both contextual and fine-grained linguistic information. Different models were trained on Wikipedia (4GB), OSCAR (138GB), and CroissantLLM (1.5TB) datasets.

Key highlights:

-   Unified training framework for French text & speech encoders
-   JEPA / data2vec 2.0-style masked feature prediction
-   Hybrid data2vec 2.0 + MLM for text encoders
-   Competitive results compared to established French baselines


Models released on Hugging Face

📄 Paper: https://arxiv.org/abs/2601.05911

🤗 Models: https://huggingface.co/PantagrueLLM

-   Speech models:
   https://huggingface.co/collections/PantagrueLLM/speech-only-models
-   Text models:
   https://huggingface.co/collections/PantagrueLLM/text-only-models

We hope Pantagruel serves as a useful resource for the research community working on French speech and text modeling.

Submitted on February 3, 2026

Updated on February 3, 2026