Wednesday, December 18th, 2024
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn
Clustering Comparable Corpora: From Dataset Creation to Deep Learning Models and their Evaluation
Clustering is a key technique in text processing that aids in the organization and analysis of large text corpora. Comparable corpora, which consist of text collections from different sources that share thematic or domain similarities, benefit significantly from clustering as it helps reveal shared themes and relationships between documents. This thesis is dedicated to the clustering of comparable corpora, with a focus on bilingual comparable corpora. In particular, it addresses the issue of the lack of ground truth datasets for evaluating clustering algorithms tailored to comparable corpora. To overcome this challenge, we propose a novel methodology and an associated tool for extracting clustered bilingual comparable corpora from Wikipedia. This methodology provides precise control over various characteristics, including cluster distribution across languages, the number of clusters per document, and the general domain of the collection. Furthermore, the tool is publicly available, ensuring accessibility for researchers across diverse fields. In this thesis, we also explore the question: does accounting for different cluster distributions across languages improve the clustering of bilingual comparable corpora? To address this, we develop novel clustering models specifically tailored to comparable corpora, building upon the state-of-the-art Deep Kmeans clustering model. Lastly, we investigate external validation indices for soft clustering partitions, identifying their limitations and challenges. In response, we propose an alternative approach to external validation that offers a more effective evaluation for texts covering multiple topics. The contributions of this thesis improve the reliability and effectiveness of clustering methods in the analysis of comparable corpora, providing deeper insights into multilingual and cross-domain text collections.
Date and place
Wednesday, December 18th, 2024 at 14:00
Maison Jean Kuntzman
Jury members
Professeur des Universités, Université Grenoble Alpes (Supervisor)
Professeure des Universités, Grenoble INP - UGA (Co-supervisor)
Professeur des Universités, Université Paris Cité (Reviewer)
Directeur de Recherche , CNRS, Délégation Ile de France Sud (Reviewer)
Professeur des Universités, Université Grenoble Alpes (Examiner)
Maitresse De Conferences, Université de Bretagne Occidentale (Examiner)
- Share
- Share on Facebook
- Share on X
- Share on LinkedIn