Joseph Emeras - Workload Traces Analysis and Replay in Large Scale Distributed Platforms

Organisé par : 

Joseph Emeras

Intervenant : 

Joseph Emeras

Équipes : 
Information détaillée : 

- Lieu de soutenance : Grand Amphi INRIA

- Membres du Jury :

  • M. Dror FEITELSON, The Hebrew University, Jerusalem, Rapporteur
  • Mme Christine MORIN, INRIA, Examinateur
  • M. Denis TRYSTRAM, UJF, Examinateur
  • M. Yves DENNEULIN, Grenoble INP, Directeur de thèse
  • M. Olivier RICHARD, UJF, CoDirecteur de thèse
  • M. Christian PEREZ, INRIA, Rapporteur
  • M. Philippe DENIEL, CEA DAM
Résumé : 

High Performance Computing is preparing the era of the transition from Petascale to Exascale. Distributed computing systems are already facing new scalability problems due to the increasing number of computing resources to manage. It is now necessary to study in depth these systems and comprehend their behaviors, strengths and weaknesses to better build the next generation. The complexity of managing users applications on the resources conducted to the analysis of the workload the platform has to support, this to provide them an efficient service.

The need for workload comprehension has led to the collection of traces from production systems and to the proposal of a standard workload format. These contributions enabled the study of numerous of these traces. This also led to the construction of several models, based on the statistical analysis of the different workloads from the collection. Until recently, existing workload traces did not enabled researchers to study the consumption of resources by the jobs in a temporal way. This is now changing with the need for characterization of jobs consumption patterns. In the first part of this thesis we propose a study of existing workload traces. Then we contribute with an observation of cluster workloads with the consideration of the jobs resource consumptions over time. This highlights specific and unattended patterns in the usage of resources from users. Finally, we propose an extension of the former standard workload format that enables to add such temporal consumptions without loosing the benefit of the existing works.

Experimental approaches based on workload models have also served the goal of distributed systems evaluation. Existing models describe the average behavior of observed systems. However, although the study of average behaviors is essential for the understanding of distributed systems, the study of critical cases and particular scenarios is also necessary. This study would give a more complete view and understanding of the performance of resource and job management. In the second part of this thesis we propose an experimental method for performance evaluation of distributed systems based on the replay of production workload trace extracts. These extracts, replaced in their original context, enable to experiment the change of configuration of the system in an online workload and observe the different configurations results. Our technical contribution in this experimental approach is twofold. We propose a first tool to construct the environment in which the experimentation will take place, then we propose a second set of tools that automatize the experiment setup and that replay the trace extract within its original context.

Finally, these contributions conducted together, enable to gain a better knowledge of HPC platforms. As future works, the approach proposed in this thesis will serve as a basis to further study larger infrastructures.