Minghan Li | LIG - Université Grenoble Alpes

Adapting Deep Neural Information Retrieval Models to Long Documents and New Domains (Adapter des modèles de recherche d'information basés sur les réseaux neuronaux profonds pour les documents longs et les nouveaux domaines)

Vendredi 7 Juillet 2023

Abstract:

In the era of big data, information retrieval (IR) plays a pivotal role in our daily lives. Deep neural networks, specifically Transformer-based models, have shown remarkable enhancements in neural IR. However, their effective- ness is constrained by limitations. This thesis aims to advance neural IR by addressing three key topics: long document retrieval for Transformer-based models, domain adaptation for dense retrieval and conversational search, and a novel differentiable approximation of listwise loss functions.

The first topic addresses the challenge of retrieving relevant information from long documents. The self-attention mechanism has the quadratic complexity, making Transformer-based models difficult to process long documents. This thesis proposes a framework that pre-ranks passages within a long document based on the query, and then combines or processes the filtered top-ranking passages to obtain the document relevance score. Experiments on IR collec- tions with both interaction and late interaction based models demonstrate state-of-the-art level effectiveness.

The second topic explores domain adaptation for dense retrieval and con- versational search. Dense retrieval models’ generalization ability on target domains is limited. This thesis proposes a self-supervision approach that generates pseudo-relevance labels for queries and documents on the target domain, using an interaction-based model T5-3B from a BM25 list. Dif- ferent negative mining strategies are investigated to improve the proposed approach. Conversational search is challenging as the system needs to un- derstand ambiguous user intent in each query turn, and obtaining labels for target datasets is difficult. Existing approaches for training conversational dense retrieval models can be further improved to tackle the domain shift issue. This thesis uses a T5-Large model to generate rewritten queries for tar- get datasets and applies a similar approach as in dense retrieval to generate pseudo-relevance data. Experiment results show that the pseudo-relevance labeling approach improves the dense retrieval and conversational dense retrieval models on the target domain when fine-tuned on the generated data.

The third topic focuses on the use of listwise loss functions for learning to rank in IR. Popular IR metrics are not differentiable, limiting the potential of training better IR models. This thesis proposes a softmax-based approxi- mation of the rank indicator function, a key component in the design of IR metrics. Experiments on learning to rank and text-based IR tasks demonstrate the good quality of the proposed approximations of IR metrics.

Overall, this thesis contributes novel approaches to address important chal- lenges in IR. The proposed approaches demonstrate improvements and provide valuable insights into the development of effective IR systems.

Date et Lieu

Vendredi 7 Juillet 2023 à 14h00
Bâtiment IMAG Salle 306

Composition du Jury

ERIC GAUSSIER
Professeur des Universités, UNIVERSITE GRENOBLE ALPES (Directeur de thèse)

BENJAMIN PIWOWARSKI
Chargé de recherche, CNRS DELEGATION PARIS CENTRE (Rapporteur）

LYNDA TAMINE LECHANI
Professeur des Universités, UNIVERSITE TOULOUSE 3 - PAUL SABATIER (Rapporteure)

DIDIER SCHWAB
Professeur des Universités, UNIVERSITE GRENOBLE ALPES (Examinateur)

JAAP KAMPS
Professeur associé, Universiteit van Amsterdam (Examinateur)

SOPHIE ROSSET
Directrice de recherche, CNRS DELEGATION ILE-DE-FRANCE SUD (Examinatrice)