Jeudi 3 Octobre 2024
- Imprimer
- Partager
- Partager sur Facebook
- Share on X
- Partager sur LinkedIn
Sensitivity Sampling for Coreset-Based Data Selection
Vincent Cohen-Addad, currently at Google, will present his work
Abstract:
The scale of modern machine learning models and data has made data selection a central problem. In this talk, we focus on the problem of finding the best representative subset of a dataset to train a machine learning model. We provide a new data selection approach based on 𝑘-means clustering and sensitivity sampling. Assuming embedding representation of the data and that the model loss is Hölder continuous with respect to these embeddings, we prove that our new approach allows to select a set of ``typical'' 1/𝜖2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±𝜖) factor and an additive 𝜖𝜆Φ𝑘, where Φ𝑘 represents the 𝑘-means cost for the input data and 𝜆 is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show that our sampling strategy can be used to define new sampling scores for regression, leading to a new active learning strategy that is comparatively simpler and faster than previous ones like leverage score.
Date et lieu
jeudi 3 octobre à 11h00
Bâtiment IMAG, salle 306
Organisé par
Eric GAUSSIER
Prof. Univ. Grenoble Alpes, Member of the Institut Universitaire de France (IUF), Director of the Grenoble Multidiscipinary Institute in Artificial Intelligence
- Imprimer
- Partager
- Partager sur Facebook
- Share on X
- Partager sur LinkedIn