Mercredi 29 Septembre 2021
Classification Multi-classe et Sélection de Variables avec des Données Partiellement Étiquetées


Learning with partially labeled data, known as semi-supervised learning, deals with problems where few training examples are labeled while available unlabeled data are abundant and valuable for training. In this thesis, we study this framework in the multi-class classification case with a focus on self-learning and feature selection. Self-learning is a classical approach that iteratively assigns pseudo-labels to unlabeled training examples with a confidence score above a predetermined threshold. This pseudo-labeling technique is prone to error and runs the risk of adding noisy labels into unlabeled training data. Our first contribution is to propose a theoretical framework for analyzing self-learning in the multi-class case. We derive a transductive bound over the risk of the multi-class majority vote classifier and propose to use this bound for automatically choosing the pseudo-labeling threshold. Then, we introduce a mislabeling error model to analyze the error of the majority vote classifier in the case of the pseudo-labeled data. We derive a probabilistic C-bound over the majority vote error given an imperfect label. Our second contribution is an extension of the self-learning strategy to the case where some unlabeled examples come from classes not previously seen. The new approach is applied for classification of real biological data, and it is based on assuming the existence of clusters in unlabeled data.
Finally, we propose an approach for semi-supervised feature selection that utilizes self-learning to increase the variety of training data and a new modification of the genetic algorithm to perform a feature subset search. The proposed genetic algorithm produces both a sparse and accurate solution by considering feature weights during its evolutionary process and iteratively removing irrelevant features.

Mis à jour le 21 septembre 2021