Martin Kirchgessner - Mining and ranking closed itemsets from large-scale transactional datasets

Organized by: 
Martin Kirchgessner
Martin Kirchgessner
Detailed information: 

Membres du jury :

  • Mme. Laure Berti-Equille, Senior Scientist au Qatar Computing Research Institute, rapporteur
  • M. Florent Masseglia, chargé de Recherche INRIA, rapporteur
  • Mme Céline Robardet, professeur à l’INSA de Lyon, examinatrice
  • M. Noël De Palma, professeur à l’Université Grenoble Alpes, examinateur
  • M. Renaud Lachaize, maître de conférences à l’Université Grenoble Alpes, examinateur
  • M. Alexandre Termier, professeur à l’Université de Rennes 1, examinateur
  • Mme Sihem Amer-Yahia, directrice de recherche CNRS, directeur de thèse
  • M. Vincent Leroy, maître de conférences à l’Université Grenoble Alpes, co-encadrant de thèse


Réalisation technique : Antoine Orlandi | Tous droits réservés

The recent increase of data volumes raises new challenges for itemset mining algorithms. In this thesis, we focus on transactional datasets (collections of items sets, for example supermarket tickets) containing at least a million transactions over hundreds of thousands items. These datasets usually follow a "long tail" distribution: a few items are very frequent, and most items appear rarely. Such distributions are often truncated by existing itemset mining algorithms, whose results concern only a very small portion of the available items (the most frequents, usually). Thus, existing methods fail to concisely provide relevant insights on large datasets. We therefore introduce a new semantics which is more intuitive for the analyst: browsing associations per item, for any item, and less than a hundred associations at once.
To address the items' coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset, the k most frequent closed itemsets containing this item. We present an algorithm to solve it, TopPI. We show that TopPI computes efficiently interesting results over our datasets, outperforming simpler solutions or emulations based on existing algorithms, both in terms of run-time and result completeness. We also show and empirically validate how TopPI can be parallelized, on multi-core machines and on Hadoop clusters, in order to speed-up computation on large scale datasets.
Our second contribution is CAPA, a framework allowing us to study which existing measures of association rules' quality are relevant to rank results. This concerns results obtained from TopPI or from jLCM, our implementation of a state-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantitative study shows that the 39 quality measures we compare can be grouped into 5 families, based on the similarity of the rankings they produce. We also involve marketing experts in a qualitative study, in order to discover which of the 5 families we propose highlights the most interesting associations for their domain.
Our close collaboration with Intermarché, one of our industrial partners in the Datalyse project, allows us to show extensive experiments on real, nation-wide supermarket data. We present a complete analytics workflow addressing this use case. We also experiment on Web data. Our contributions can be relevant in various other fields, thanks to the genericity of transactional datasets.
Altogether our contributions allow analysts to discover associations of interest in modern datasets. We pave the way for a more reactive discovery of items' associations in large-scale datasets, whether on highly dynamic data or for interactive exploration systems.