Monday, May 23, 2022
On Algebraic Foundations for the optimization of Iterative Programming with Distributed Data Collections
Abstract :
The goal of my PhD is to study the optimization and the distribution of queries, especially recursive queries, handling large amounts of data. I propose extensions to formal approaches along two main lines of work: (1) algebras based on the relational model, for which I propose Dist-µ-RA, and (2) algebras based on generic collections of arbitrary types, for which I propose µ-monoids. Dist-µ-RA is a system that extends the  -RA algebra to the distributed setting. Regarding the algebraic aspect, it integrates well with the relational algebra and inherits its advantages including the fact that queries are optimized regardless of their initial shape and translation into the algebra. With respect to distribution, different strategies for evaluating recursive algebraic terms in a distributed setting have been studied. These  strategies are implemented as plans with automated techniques for distributing data in order to reduce communication costs. Experimental results on both real and synthetic graphs show the effectiveness of the proposed approach compared to existing systems. µ-monoids is an extension of the monoid algera with a fixpoint operator that models recursion. The extended  µ-monoids algebra is suitable for modeling recursive computations with distributed data collections such as the ones found in Big Data frameworks. The major interest of the “µ” fixpoint operator is that, under prerequisites that are often met in practice, it can be considered as a monoid homomorphism and thus can be evaluated by parallel loops with one final merge rather than by a global loop requiring network overhead after each iteration. Rewriting rules for optimizing fixpoint terms, such as pushing filters, are proposed. In particular, I propose a sufficient condition on the repeatedly evaluated term regardless of its shape, as well as a method using polymorphic types and a type system such as Scala’s to check whether this condition holds. I also propose a rule to prefilter a fixpoint before a join. The third rule allows for pushing aggregation functions inside a fixpoint. experiments with the Spark platform illustrate performance gains brought by these systematic optimizations.
Mis à jour le 20 May 2022