Vendredi 15 Octobre 2021
Human-guided exploration of data collections
Abstract:


Data exploration aims to guide the understanding of data collections and define the type of questions that can be asked on top, often in interactive exploration processes. Data exploration deals with raw digital data collections coping with the uncertainty of data content and analysis where query results cannot be necessarily correct and complete (i.e., results consisting in all the data tuples respecting requirements expressed by a question). Data exploration engines will be next-generation systems promoting a new querying philosophy that gradually converges into queries that can exploit raw data collections that cope with data explorers (i.e., users) expectations.
This thesis proposes HILDEX, a human-in-the-loop based data exploration system that enables users to explore textual data collections by gradually refining queries and associated results. Textual data collections are pre-processed using Machine Learning and Artificial Intelligence text processing algorithms.
HILDEX implements exploration algorithms proposed in this work (query morphing, query-by example, queries-as-answers) that allow refining an initial query by considering the content of the collections to be explored to increase the possibility to explore the data better. Therefore, HILDEX proposes a workflow to explore texts by analysing data samples obtained by queries that can be refined through human in the loop-based tasks. Partial exploration results are assessed through metrics (precision, similarity) and information that explains why some documents are contained in these results. By exploring documents in partial results, explanations and metrics, the user can decide to continue interacting with HILDEX for rewriting queries until she is satisfied with both queries and results. The algorithms and HILDEX have been experimented on data related to crises in urban computing and the exploration of information on COVID-19.
Mis à jour le 7 octobre 2021