Project 7: Leveraging Large Language Models for Regulatory Genomics on HPC

The project explores how large language models can be used as a tool to identify and classify DNA function.

Project Description

Gene expression is regulated by specific DNA sequences whose patterns are distributed and context‑dependent. This project will apply large language models tailored for genomics to learn sequence representations and perform downstream classification and generation. The approach leverages HPC to pretrain on human DNA subsequences, finetune for classification, and generate synthetic sequences with safety guardrails, using transformer‑based methods adapted for DNA tokenization.

Learning Objectives

  1. Represent and process DNA sequences for AI models
  2. Run and monitor AI training on HPC systems
  3. Apply and evaluate models for genomic tasks
  4. Explore safe generation of synthetic DNA sequences
  5. Explain and apply key HPC/AI concepts—including parallel computing, job scheduling, and model training—by accurately describing their purpose and implementing them in a computational workflow.
  6. Collaborate effectively in a team-based research environment by contributing to shared code repositories, participating in peer reviews, and communicating findings clearly during project meetings.