Skip to main content

Hugo LEBEAU

Tuesday, March 11, 2025

Random Matrix and Tensor Models for Large Data Processing

ABSTRACT
The exponential growth of computing power has enabled the widespread deployment of machine learning, which has in turn given rise to new challenges in data processing. The sheer volume of data now being generated means that the standard statistical assumption of a number of samples far greater than their dimension is no longer tenable. In the paradigm of the Big Data era, datasets are typically of very large dimension and may also comprise several modes, indicating a variety of sources, modalities, domains, and so on. Furthermore, the advancement of technologies required to develop models capable of processing vast quantities of data results in significant environmental and human costs. In light of these concerns, it is imperative to promote a more clever and prudent use of our resources.
Random matrix theory provides powerful tools to precisely study the statistical and computational limitations associated with the processing of large and multidimensional data. Through this lens, we examine several learning approaches to identify the relevant parameters influencing the success of a task and thereby facilitate an informed use.
First of all, we establish a ``central limit theorem'' on the behavior of the entries of spike eigenvectors of a Gram kernel matrix. This is an essential result to predict the performances of spectral clustering, which has so far been missing from the literature.
Then, we study an extension of spectral clustering to data streams. This approach makes it possible to cluster a potentially very large dataset with controlled and limited memory usage. In addition to revealing the exotic spectral behavior of the associated matrix model, our results characterize the reconstruction performances of an observed noisy signal in a data stream. In addition, we show that with an astute management of the available memory, it is possible to achieve performances comparable to those obtained without resource constraints. This means a significant reduction in memory costs compared with standard spectral clustering, with negligible loss of performance.
Finally, we turn our attention to the computational limits to tensor estimation, with a particular focus on low-rank approximation. Through the study of matrices obtained by unfolding a random tensor, we describe precisely the reconstruction performances of a noisy tensor signal by means of a truncated MLSVD (which generalizes the concept of truncated SVD to tensors). In contrast to the matrix case, this estimate is only quasi-optimal, and we then investigate the computation of the best low-multilinear-rank tensor approximation with the HOOI algorithm. Using a similar approach, we examine the multi-view clustering problem from the perspective of a rank-one tensor approximation. Our results highlight and precisely quantify the pivotal role of view informativeness in the quality of the estimation. Furthermore, this study sheds light on a central phenomenon in tensor approximation: the statistical-to-computational gap, that is, the fundamental inability to achieve algorithmically the performances theoretically attainable by statistical estimation.

Date and place

Tuesday, March 11 at 13:30
GIPSA-lab, Salle JM Chassery
and  Zoom

Jury members

Thèse dirigé par COUILLET Romain et co-encadrée par CHATELAIN Florent
Thèse préparée au sein du LIG et GIPSA-lab

Philippe LOUBATON
Professeur des universités, Université Gustave Eiffel (Rapporteur)
Rémi BARDENET
Directeur de recherche CNRS, Université de Lille (Rapporteur)
Walid HACHEM
Directeur de recherche CNRS, Université Gustave Eiffel (Examinateur)
Mylène MAÏDA
Professeure des universités, Université de Lille (Examinatrice)
Pierre COMON
Directeur de recherche CNRS émérite, Université Grenoble Alpes (Examinateur)
Olivier MICHEL
Professeur des universités, Université Grenoble Alpes (Président)

Submitted on March 14, 2025

Updated on March 14, 2025