Anna Laskina | LIG - Université Grenoble Alpes

Wednesday, December 18th, 2024

Clustering Comparable Corpora: From Dataset Creation to Deep Learning Models and their Evaluation

Abstract:

Clustering is a key technique in text processing that aids in the organization and analysis of large text corpora. Comparable corpora, which consist of text collections from different sources that share thematic or domain similarities, benefit significantly from clustering as it helps reveal shared themes and relationships between documents. This thesis is dedicated to the clustering of comparable corpora, with a focus on bilingual comparable corpora. In particular, it addresses the issue of the lack of ground truth datasets for evaluating clustering algorithms tailored to comparable corpora. To overcome this challenge, we propose a novel methodology and an associated tool for extracting clustered bilingual comparable corpora from Wikipedia. This methodology provides precise control over various characteristics, including cluster distribution across languages, the number of clusters per document, and the general domain of the collection. Furthermore, the tool is publicly available, ensuring accessibility for researchers across diverse fields. In this thesis, we also explore the question: does accounting for different cluster distributions across languages improve the clustering of bilingual comparable corpora? To address this, we develop novel clustering models specifically tailored to comparable corpora, building upon the state-of-the-art Deep Kmeans clustering model. Lastly, we investigate external validation indices for soft clustering partitions, identifying their limitations and challenges. In response, we propose an alternative approach to external validation that offers a more effective evaluation for texts covering multiple topics. The contributions of this thesis improve the reliability and effectiveness of clustering methods in the analysis of comparable corpora, providing deeper insights into multilingual and cross-domain text collections.

Date and place

Wednesday, December 18th, 2024 at 14:00
Maison Jean Kuntzman

Jury members

Eric GAUSSIER
Professeur des Universités, Université Grenoble Alpes (Supervisor)

Gaëlle CALVARY
Professeure des Universités, Grenoble INP - UGA (Co-supervisor)

Mohamed NADIF
Professeur des Universités, Université Paris Cité (Reviewer)

Pierre ZWEIGENBAUM
Directeur de Recherche , CNRS, Délégation Ile de France Sud (Reviewer)

François PORTET
Professeur des Universités, Université Grenoble Alpes (Examiner)

Liana ERMAKOVA
Maitresse De Conferences, Université de Bretagne Occidentale (Examiner)