Jeudi 30 Janvier 2025
- Imprimer
- Partager
- Partager sur Facebook
- Share on X
- Partager sur LinkedIn
LLM Inference on Groq
The race to model size continues. At the time of writing, monolithic 400B+ parameter Large Language Models (LLMs) have been publicly released. The larger these training and inference workloads become, the more demand they create for GPU memory capacity. This has resulted in accelerators that leverage complex technology as part of their memory hierarchy. Groq addresses this scaling challenge in opposite ways.
We’ll explain how we co-designed a compilation-based software stack and a class of accelerators called Language Processing Units (LPUs). The core architectural decision underlying LPUs is determinism – systems of clock-synchronized accelerators are programmed via detailed instruction scheduling. This includes networking, which is also scheduled statically. As a result, LPU-based systems have fewer networking constraints, which make their SRAM-based design practicable without the use of a memory hierarchy. This redesign results in high utilization and low end-to-end system latency. Moreover, determinism enables static reasoning about program performance – the key to kernel-free compilation.
We’ll talk about the challenges of breaking models apart over networks of accelerators, and outline how our HW/SW system architecture keeps enabling breakthrough LLM inference latency at all model sizes.
Date et lieu
Jeudi 30 janvier 2025 15:00 – 17:00
Bâtiment IMAG, salle 406
Organisé par
Bruno RAFFIN
Responsable de l'équipe DataMove
- Imprimer
- Partager
- Partager sur Facebook
- Share on X
- Partager sur LinkedIn