Aller au contenu principal

Benjamin Roussillon

Mercredi 15 Septembre 2021

Apprentissage en présence de données stratégiques: modèles et solutions

Abstract (short):
We consider problems where an analyst aims to learn when data are strategically produced. This happens in many applications such as spam filtering where the analyst wants to learn which mails are spam and spammers want to avoid being recognized as spam (in this case the input data is the mail) or the prediction of quantity of products to give to franchised stores where stores manipulate their data to obtain a higher quantity of popular products (in this case the input data is the sales history of the store).
We focus on the two ubiquitous problems of classification where the analyst outputs a class corresponding to the data (fitting the spam example) and linear regression where the analyst outputs a quantity corresponding to the data (fitting the quantity prediction problem). We show that, for both of these problems, classical results obtained in non-strategic settings do not hold anymore, highlighting the necessity to take into account strategic considerations in learning.


Abstract (long):

In this thesis, we consider the problem of learning when data are strategically produced. This challenges the widely used assumptions in machine learning that test data are independent from training data which has been proved to fail in many applications where the result of the learning problem has a strategic interest to some agents. We study the two ubiquitous problems of classification and linear regression and focus on fundamental learning properties on these problems when compared to the classical setting where data are not strategically produced. 

We first consider the problem of finding optimal classifiers in an adversarial setting where the class-1 data is generated by an attacker whose objective is not known to the defender---an aspect that is key to realistic applications but has so far been overlooked in the literature.

To model this situation, we propose a Bayesian game framework where the defender chooses a classifier with no a priori restriction on the set of possible classifiers. The key difficulty in the proposed framework is that the set of possible classifiers is exponential in the set of possible data, which is itself exponential in the number of features used for classification. To counter this, we first show that Bayesian Nash equilibria can be characterized completely via functional threshold classifiers with a small number of parameters. We then show that this low-dimensional characterization enables us to develop a training method to compute provably approximately optimal classifiers in a scalable manner; and to develop a learning algorithm for the online setting with low regret (both independent of the dimension of the set of possible data). 

We then consider the problem of linear regression from strategic data sources. In the classical setting where the precision of each data point is fixed, the famous Aitken/Gauss-Markov theorem in statistics states that generalized least squares  (GLS) is a so-called ``Best Linear Unbiased Estimator'' (BLUE) and is consistent (the model is perfectly learned when the amount of data grows). In modern data science, however, one often faces strategic data sources, namely, individuals who incur a cost for providing high-precision data. We model this as learning from strategic data sources with a public good component, i.e., when data is provided by strategic agents who seek to minimize an individual provision cost for increasing their data's precision while benefiting from the model's overall precision. Our model tackles the case where there is uncertainty on the attributes characterizing the agents' data---a critical aspect of the problem when the number of agents is large. We show that,  in general, Aitken's theorem does not hold under strategic data sources, though it does hold if individuals have identical provision costs (up to a multiplicative factor) . When individuals have non-identical costs, we derive a bound on the improvement of the equilibrium estimation cost that can be achieved by deviating from GLS, under mild assumptions on the provision cost functions and on the possible deviations from GLS. We also provide a characterization of the game's equilibrium, which reveals an interesting connection with optimal design. Subsequently, we focus on the asymptotic behavior of the covariance of the linear regression parameters estimated via generalized least squares as the number of data sources becomes large. We provide upper and lower bounds for this covariance matrix and we show that, when the agents' provision costs are superlinear, the model's covariance converges to zero but at a slower rate relative to virtually all learning problems with exogenous data. On the other hand, if the agents' provision costs are linear, this covariance fails to converge. This shows that even the basic property of consistency of generalized least squares estimators is compromised when the data sources are strategic.

Date et Lieu

Le mercredi 15 Septembre à 14h00
Salle de séminaire 1 et

Organisé par


Membres du Jury

Chercheur, Cyber Defence Campus, Suisse, Rapporteur
Professeur des universités, LIA, Université d’Avignon, France, Rapporteur
Marie-Christine ROUSSET
Professeur des universités, LIG, Université Grenoble-Alpes, France, Examinateur
Johanne COHEN
Directrice de recherche, LRI, CNRS, France, Examinateur
Professeur des universités, LAMSADE, université Paris Dauphine-PSL, France, Examinateur
Chargé de recherche, LIG, CNRS, France, Co-encadrant de thèse
Chargé de recherche, LIG, CNRS, France, Directeur de thèse

Publié le 9 septembre 2021

Mis à jour le 9 septembre 2021