Tutorial 5 — Training Your Own Network on the WSU Grid
AI Imaging Workshop · the 11:00 AM "Training your own neural
network using the WSU grid" block (and the 9:30 SLURM /
reproducible-computing themes). Prerequisites: Tutorials 1
& 3 (the imaging env, nnU-Net v2 installed).
Training needs a GPU for hours, so it runs on the WSU grid through the SLURM scheduler — you write a small script describing the job, submit it, and SLURM runs it when a GPU is free. This tutorial covers SLURM, a real nnU-Net training run, and the practices that make results reproducible.
1. SLURM in five commands
SLURM shares the cluster's GPUs/CPUs across many users. You never run heavy work directly on the login node — you submit a batch job:
| Command | What it does |
|---|---|
sbatch job.sh |
submit a job script to the queue |
squeue -u $USER |
see your jobs (PD = pending, R = running) |
scancel <jobid> |
cancel a job |
sacct -j <jobid> |
history/exit status of a finished job |
sinfo -s |
what partitions/nodes exist |
2. Anatomy of a job script
A SLURM script is a normal bash script with #SBATCH
directives at the top that request resources. Create
train.sh:
#!/usr/bin/env bash
#SBATCH --job-name=nnunet_train
#SBATCH --partition=gpu # the GPU partition (ask your admin for the name)
#SBATCH --gres=gpu:1 # request 1 GPU
#SBATCH --cpus-per-task=8 # CPU cores for data loading
#SBATCH --mem=64G # RAM
#SBATCH --time=24:00:00 # max wall-clock (job is killed past this)
#SBATCH --output=logs/%x_%j.out # %x=job-name, %j=job-id
set -euo pipefail # stop on the first error (good practice)
# --- make the environment reproducible ---
source ~/miniforge3/etc/profile.d/conda.sh
source ~/miniforge3/etc/profile.d/mamba.sh
mamba activate imaging
export nnUNet_raw=~/workshop/nnunet/raw
export nnUNet_preprocessed=~/workshop/nnunet/preprocessed
export nnUNet_results=~/workshop/nnunet/results
# --- the actual work ---
nnUNetv2_train 1 3d_fullres 0 # dataset 1, config 3d_fullres, fold 0
Submit and watch it:
mkdir -p logs
sbatch train.sh
squeue -u $USER
tail -f logs/nnunet_train_*.out # live log
Key directives to understand:
--gres=gpu:1 (you get nothing GPU without it),
--time (too short = killed mid-run; too long = harder to
schedule), and --output (where stdout/errors go — always
read this when something fails).
3. A full nnU-Net training run
nnU-Net training is three steps. (Preparing the dataset into
nnU-Net's folder format is its own task — for the workshop we reuse the
prepared CTSpinoPelvic1K dataset; the conversion script
lives in the spinopelvic-seg repo.)
Step 1 — plan & preprocess (auto-configures the network to the data; run once per dataset):
nnUNetv2_plan_and_preprocess -d 1 --verify_dataset_integrity
Step 2 — train. With 5-fold cross-validation you train five models (folds 0–4); each is a separate, hours-long GPU job. Submit them as one array job so SLURM runs them in parallel as GPUs free up:
#SBATCH --array=0-4 # add this line to train.sh
# ...and change the work line to:
nnUNetv2_train 1 3d_fullres ${SLURM_ARRAY_TASK_ID}
sbatch train.sh now launches folds 0–4 together — that's
parallelism: five independent jobs, not one
five-times-longer job.
Step 3 — inference with your freshly trained model
is exactly Tutorial 3 (nnUNetv2_predict), pointing
-d/-f at what you trained.
4. Reproducibility — the part reviewers check
"Reproducible" means someone else (or future-you) runs the same command and gets the same result. Five habits:
- Pin the environment. Record exact versions so the
software can't drift:
mamba env export > environment.yml # commit this file - Fix the random seed. nnU-Net seeds its splits; for
your own code,
np.random.seed(42)/torch.manual_seed(42). Same seed → same split → same training. - Freeze the data splits. Save
splits_final.json(which cases are train/ val/test) and version it, so train/test never silently change between runs. - No data leakage. A case used in training must never appear in its own test evaluation. With cross-validation, score each case with the fold that did not train on it ("out-of-fold") — this is how our dataset's model-completed labels are made honest.
- Version control everything — scripts, configs, env
file — in git, and log the exact command + job id. The
set -euo pipefailand#SBATCH --outputlines above are part of this: fail loudly, keep the log.
5. From one GPU to a pipeline (Nextflow)
When a study is many steps × many cases (preprocess → train → infer →
score), hand-submitting sbatch jobs doesn't scale.
Nextflow (installed in Tutorial 1) describes the
pipeline once and submits all the SLURM jobs for you, re-running only
what changed (-resume) — reproducible and parallel by
construction. A minimal example:
process predict {
cpus 8 ; memory '64 GB'
input: path ct
output: path "pred_${ct.baseName}.nii.gz"
script: "nnUNetv2_predict -i . -o . -d 1 -c 3d_fullres -f 0"
}
nextflow run pipeline.nf -profile slurm -resume
You don't need to master Nextflow today — just know it's how the full
CTSpinoPelvic1K pipeline (hundreds of cases, multiple
stages) is orchestrated on the grid.
Recap
- Never compute on the login node —
sbatcha job that requests a GPU (--gres=gpu:1) and a sensible--time. - nnU-Net training = plan/preprocess → train (5 folds) → infer; run the folds as a SLURM array for parallelism.
- Reproducibility = pinned env + fixed seeds + frozen splits + no leakage + version control. This is what turns "it worked on my machine" into a result.
Next: Writing an AI paper — turning a dataset and an evaluation into a publishable contribution.