Aller au contenu principal

Nicolas Tollenaere

Découplage de l'espace d'optimisation des calculs de tenseurs pour une meilleure compréhension de la performance sur CPU

Mercredi 14 décembre 2022

Abstract:

This work focuses on the problem of optimizing a class of programs we call tensor computation, which includes matrix multiplication, tensor contraction (which is a generalization of matrix multiplication), and convolution.
Our approach is based on insights introduced in GotoBLAS \cite{GotoBLAS} and BLIS \cite{BLIS1}.
One of the critical points of our methodology is the use of what we call a \emph{microkernel}, a small optimized block of code that serves as a building block for the whole program.
We present a pipeline called TTiLE that automatically generates a specialized code for a given problem size.

We separate the optimization scheme into two phases: first we select the microkernel with an empiric search.
Then we select the outer levels.
In this last phase, we leverage some of the work done in performance modelization.
This methodology allows us to match the performance of state-of-the-art tools on recent architectures.

To explore our design space and evaluate our search strategy, we crafted a dedicated code generation and evaluation platform.

In the process of this exploration, we had to reconsider the way we evaluate a search process.
We introduce a clear cut between what we call the \emph{search space} and the \emph{search heuristic}.
We characterize a search space by its distribution, which is evaluated empirically by random sampling.
This distribution is then used as a baseline - a search heuristic is worth using only if, for the same number of tests, it can converge toward a better candidate than a random search.

This evaluation process allows us to characterize which choices do improve the search by rising the density of good candidates in the space, and which ones do not.
Some of the choices we did early in our research, such as generating only perfectly nested or nearly perfectly nested loops, proved to have little to no positive impact on the final performance.

Our results show that the combination of a very straightforward code generation scheme, a restriction to use only selected microkernels at the inner level, and a random search of the outer levels of the loop nest converges very quickly toward candidates that are competitive with the state of the art (mainly AutoTVM/Ansor, but also Halide and OneDNN) on Intel CPU, both in parallel and sequential settings.

The main takeaway of this thesis is the coupling between space definition and exploration.
That is, the performance of a given space exploration heuristic (be it guided by a deterministic model, or based on a learning algorithm) strongly depends on the configuration of the space we are looking into.
A very loose restriction of a space where a huge proportion of candidates are inadequate imposes the need for a sophisticated search.
On the contrary, restricting more tightly the space of possible implementations by constraining some design choices can make even naive search strategies, such as random selection, competitive with more elaborated ones.

Date et Lieu

Mercredi 14 décembre 2022 à 15h
Antenne Inria de Minatec

Publié le 9 décembre 2022

Mis à jour le 9 décembre 2022