Mardi 16 Juin 2026
- Imprimer
- Partager
- Partager sur Facebook
- Partager sur LinkedIn
Stochastic Optimization Algorithms in Machine Learning: Dynamics, Convergence, Generalization
Abstract
This thesis considers machine learning through the lens of mathematical optimization, and is concerned with several issues that can arise: non-convexity of the loss landscape, uncertainty and shifts in the data, and intricate structure of the parameter space.
In the first part of this thesis, we focus on stochastic gradient methods, and in particular stochastic gradient descent (SGD), which remains the workhorse for training today’s largest machine learning systems. Despite SGD’s long history and overwhelming success, one fundamental question about its behavior on general objectives remains open: what is the long-run behavior of SGD on non-convex landscapes? In particular, which regions of the parameter space are visited the most by SGD? By how much? And how long does it take for SGD to reach a set of parameters of interest, for instance the global minimum?
In this thesis, our goal is to give a general, principled answer for arbitrary non-convex objectives. Using large deviations and the theory of randomly perturbed dynamical systems, we show that the long-run distribution of SGD takes the form of a Boltzmann–Gibbs law: the step-size acts as a temperature, and the “energy levels” are determined jointly by the objective and the noise. As a consequence, SGD concentrates exponentially around the minimum-energy state — which need not be the global minimum — while other regions are visited with probabilities exponentially proportional to their energy. The same tools yield matching upper and lower bounds on the time SGD takes to reach a neighborhood of the global minimum, showing that this time is controlled by the most “costly” obstacles in the landscape, linking the problem’s global geometry with the statistics of the noise. Taken together, these works provide a sharp characterization of the long-run behavior of SGD on general non-convex objectives.
The second part of this thesis turns to robust and constrained optimization. We first study Wasserstein distributionally robust optimization (WDRO), a framework that hedges against distribution shifts by minimizing the worst-case loss over a Wasserstein ball around the empirical distribution. We propose an entropic regularization framework that makes the formulation tractable in standard machine learning processes, and we establish the first exact generalization bounds — upper bounds on the true risk without error terms — that avoid the curse of dimensionality in the choice of the radius of the ball.
We then move to constrained optimization and analyze the last-iterate convergence of Bregman proximal methods on constrained variational inequalities, introducing the Legendre exponent — a local regularity measure of the Bregman function near a solution. We show that this quantity determines the convergence rate in both deterministic and stochastic settings, yielding a strict linear/sublinear dichotomy depending on the choice of regularizer.
Date et lieu
Mardi 16 Juin à 15:00
Amphithéâtre Esclangon, bâtiment Ampère A
Composition du jury
Advisors:
Franck Iutzeler
Université de Toulouse
Jérôme Malick
CNRS, Université Grenoble Alpes
Panayotis Mertikopoulos
CNRS, Université Grenoble Alpes
Reviewers:
Francis Bach
Inria, École Normale Supérieure
John Duchi
Stanford University
Examiners:
Niao He
ETH Zürich
Anatoli Juditsky
Université Grenoble Alpes
Julien Mairal
Inria, Université Grenoble Alpes
Gabriel Peyré
CNRS, École Normale Supérieure
Université Grenoble Alpes — École Doctorale MSTII — Spécialité : Mathématiques Appliquées
Équipes DAO (LJK) et GHOST (LIG/Inria)
- Imprimer
- Partager
- Partager sur Facebook
- Partager sur LinkedIn