Aller au contenu principal

Talk Valentin Reis

Jeudi 30 Janvier 2025

LLM Inference on Groq

Abstract:

The race to model size continues. At the time of writing, monolithic 400B+ parameter Large Language Models (LLMs) have been publicly released. The larger these training and inference workloads become, the more demand they create for GPU memory capacity. This has resulted in accelerators that leverage complex technology as part of their memory hierarchy. Groq addresses this scaling challenge in opposite ways.

We’ll explain how we co-designed a compilation-based software stack and a class of accelerators called Language Processing Units (LPUs). The core architectural decision underlying LPUs is determinism – systems of clock-synchronized accelerators are programmed via detailed instruction scheduling. This includes networking, which is also scheduled statically. As a result, LPU-based systems have fewer networking constraints, which make their SRAM-based design practicable without the use of a memory hierarchy. This redesign results in high utilization and low end-to-end system latency. Moreover, determinism enables static reasoning about program performance – the key to kernel-free compilation.

We’ll talk about the challenges of breaking models apart over networks of accelerators, and outline how our HW/SW system architecture keeps enabling breakthrough LLM inference latency at all model sizes.

 
Speaker:
Valentin Reis is a software engineer at Groq, Inc, where he works on the software stack that powers LPU accelerators. He was previously a postdoctoral appointee at the Argonne National Laboratory in the Mathematics and Computer Science division and  owns a PhD from Univ. Grenoble Alpes he prepared in the Datamove team.

Date et lieu

Jeudi 30 janvier 2025 15:00 – 17:00
Bâtiment IMAG, salle 406

Organisé par

Bruno RAFFIN
Responsable de l'équipe DataMove

Publié le 7 janvier 2025

Mis à jour le 16 janvier 2025