Project Summary
The project combines core ideas from machine learning, scientific computing, and distributed systems. Students will work with a small transformer-based language model and a prepared text dataset, progressing from basic single-device training to multi-process distributed execution. Along the way, they will learn how text is tokenized for training, how gradients are synchronized across devices, and how tools such as PyTorch and PBS/Slurm are used to run experiments at scale. The project will also introduce practical topics such as checkpointing, mixed-precision, throughput analysis, and reproducible experimental design.
By the end of the bootcamp, participants will have built and run an end-to-end distributed training workflow for a language model, gained hands-on experience with AI on HPC systems, and developed an understanding of the challenges and opportunities involved in scaling modern machine learning workloads. This project is designed to give students a practical introduction to one of the most important computational workflows in contemporary AI research.
Learning Objectives
- Understand foundational concepts of neural network architectures and training
- Describe the architecture and core components of transformers
- Compare different distributed training strategies
- Apply LLMs to real-world scientific problems
