Yangtao WANG | LIG - Université Grenoble Alpes

Wednesday 27 November, 2024

Object Discovery in Images, Videos, and 3D Scenes

Abstract:
Object Discovery is the task of detecting and segmenting semantically coherent regions of images. Object discovery in images is fundamentally harder than the classic computer vision tasks of object detection or segmentation, due to the possibility of regions that correspond to previously unseen object categories, as well as variations of unseen object appearance due to differences in viewpoint, scale, and lighting conditions.

Robust discovery and segmentation of images of previously unseen objects requires extremely general features that can accommodate variations in object appearance, occlusion, and background clutter. Our research began with an investigation of the possibility of using the latent variables from self-supervised Vision Transformers as features for unsupervised object discovery in images. This lead to a simple yet effective algorithm, TokenCut, that is described in Chapter 3 of this thesis. Tokencut has been shown to be effective for unsupervised object discovery, unsupervised saliency detection and weakly supervised object localization tasks using a variety of datasets.

Following our success with unsupervised object discovery in images, we have extended TokenCut to unsupervised object detection in video using motion and appearance. The enhanced TokenCut algorithm integrates RGB appearance and optical flow features across video frames, creating a comprehensive graph that allows for the detection and segmentation of moving objects. This extension, described in Chapter 4, demonstrates a unified approach in discovering objects in both static and dynamic scenes, highlighting its robustness and effectiveness of TokenCut algorithm.

Encouraged by the success of our work on discovery of object in videos, we turned our attention to the problem of consistent segmentation of 3D objects in 3D scenes using natural language queries. In Chapter 5, we describe a novel approach that integrate 3D Gaussian splatting with pretrained multimodal language models. This method automates the generation and annotation of 3D masks, enabling object segmentation based on textual queries and demonstrating effectiveness on number of relevant datasets.

These results provide new solutions for object discovery in images, video and 3D scenes. They also illustrate the power of large foundational models for responding to long term hard challenges in Computer Vision and Artificial Intelligence.

Date and place

Wednesday 27 November at 9h30
Amphitheater of the Maison Jean Kuntzmann, (1er étage)

Jury members

Bernt Schiele
Professor, Max Planck Institute for Informatics, Saarland University, Reviewer
Vincent Lepetit
Professor, École des Ponts ParisTech, Reviewer
Julien Mairal
Research Director, Inria, Examiner
Diane Larlus
Principal Scientist, Naver Lab, Examiner
Shell Xu Hu
Senior Scientist, Samsung AI Center Cambridge, Examiner
James Crowley
Institut Polytechnique de Grenoble, Thesis Supervisor
Dominique Vaufreydaz
Université Grenoble Alpes, Co-Supervisor