Project 8: Teaching Machines at Scale: Distributed Training for LLMs

Project Summary

The project combines core ideas from machine learning, scientific computing, and distributed systems. Students will work with a small transformer-based language model and a prepared text dataset, progressing from basic single-device training to multi-process distributed execution. Along the way, they will learn how text is tokenized for training, how gradients are synchronized across devices, and how tools such as PyTorch and PBS/Slurm are used to run experiments at scale. The project will also introduce practical topics such as checkpointing, mixed-precision, throughput analysis, and reproducible experimental design.

By the end of the bootcamp, participants will have built and run an end-to-end distributed training workflow for a language model, gained hands-on experience with AI on HPC systems, and developed an understanding of the challenges and opportunities involved in scaling modern machine learning workloads. This project is designed to give students a practical introduction to one of the most important computational workflows in contemporary AI research.

Learning Objectives

Understand foundational concepts of neural network architectures and training
Describe the architecture and core components of transformers
Compare different distributed training strategies
Apply LLMs to real-world scientific problems