Monday 23 August 2021
Recommendation of data placement for processing in Smartgrids data lakes
This thesis deals with optimising access to the masses of data generated and/or exploited in the management of smart grids. These masses of data (raw measurements, refined data, historical data, etc.) are in practice represented in very varied data models (relational, key-value, documents, graphs, etc.) and stored in very heterogeneous Big Data systems. Indeed, these systems offer various functionalities (e.g., some cannot perform a join), data structures (for storage, indexing), algorithms and performances.
This thesis aims to address the optimisation of processing workflows on these datasets by recommending data placement on the most appropriate systems to minimise the total execution time and based on metadata describing the datasets, workflows, and storage and processing systems. The total execution time is composed of the time it takes to transform and move data, and the time it takes to execute rewritten queries based on these transformations. Indeed, we are also exploring the possibility of moving data from one system to another if it offers interesting characteristics to favour the execution of workflows queries.
The study of the techniques used in data management systems and data integration/mediation systems has convinced us that it is impossible to define a universal model for estimating the cost of executing query plans that would allow us to compare the performance of different systems. An interesting approach is to use machine learning techniques for this.
We, therefore, propose an approach named DWS - for Data, Workloads and Systems. DWS explores different combinations of systems to execute a workflow by eliminating solutions where the systems cannot execute all the operators of a query (feasibility condition) and respect the business rules regarding the storage point of the initial, intermediate or final data (compliance condition). The estimation of different queries' execution time (data transformation or extracted from the workflow) is based on the injection of statistics in the systems. The objective is to simulate the execution and thus retrieve the optimal plans and possibly the cost estimates and the estimation of the execution time by learning, including useful metadata concerning the datasets, the workloads, and the systems.
Mis à jour le 23 August 2021