Skip to main content

Seminar Vincent Cohen-Addad

Thursday 3 October, 2024

Sensitivity Sampling for Coreset-Based Data Selection

Vincent Cohen-Addad, currently at Google, will present his work

Abstract

The scale of modern machine learning models and data has made data selection a central problem. In this talk, we focus on the problem of finding the best representative subset of a dataset to train a machine learning model. We provide a new data selection approach based on 𝑘-means clustering and sensitivity sampling. Assuming embedding representation of the data and that the model loss is Hölder continuous with respect to these embeddings, we prove that our new approach allows to select a set of ``typical'' 1/𝜖2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±𝜖) factor and an additive 𝜖𝜆Φ𝑘, where Φ𝑘 represents the 𝑘-means cost for the input data and 𝜆 is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show that our sampling strategy can be used to define new sampling scores for regression, leading to a new active learning strategy that is comparatively simpler and faster than previous ones like leverage score.

Date and place

Thursday 3 October at 11:00
IMAG building, room 306

Organised by

Eric GAUSSIER
Prof. Univ. Grenoble Alpes, Member of the Institut Universitaire de France (IUF), Director of the Grenoble Multidiscipinary Institute in Artificial Intelligence

Submitted on September 23, 2024

Updated on September 23, 2024