BioRLHF / scripts /HPC_TRAINING_GUIDE.md
jang1563's picture
Phase 4: V1-aware calibration verifier, eval tools, cleanup
2145d80

BioRLHF Training on Cayuga HPC

Cluster: Cornell Cayuga HPC Target: GPU training with Mistral-7B + LoRA (SFT, DPO, GRPO)


Quick Start

# 1. SSH to Cayuga
ssh jak4013@cayuga-login1

# 2. Submit a GRPO training job
bash -l -c 'sbatch scripts/run_grpo_full.sh'

# 3. Monitor
squeue -u $USER
tail -f logs/grpo_full_*.log

Step 1: Transfer Files to HPC

From your local Mac:

rsync -avz --progress \
    /Users/jak4013/Dropbox/Bioinformatics/Claude/BioRLHF/biorlhf/ \
    jak4013@cayuga-login1:/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF/

Step 2: Set Up Conda Environment (First Time Only)

# SSH to Cayuga
ssh jak4013@cayuga-login1

# Source conda (non-interactive shell requires explicit sourcing)
. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh

# Create environment
conda create -n biorlhf python=3.10 -y
conda activate biorlhf

# Install PyTorch with CUDA support
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -y

# Install training dependencies
pip install transformers>=4.36.0 peft>=0.6.0 trl>=0.14.0
pip install bitsandbytes>=0.41.0 accelerate>=0.24.0 datasets>=2.14.0
pip install wandb scipy scikit-learn sentencepiece jsonlines

# Verify GPU access (on a GPU node)
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

Step 3: Training Options

Option A: GRPO Training (Recommended)

GRPO with verifier-based multi-reward training from an SFT checkpoint:

# Submit via SLURM (use login shell for correct sbatch version)
bash -l -c 'sbatch scripts/run_grpo_full.sh'

Key config (configs/grpo_full_v2.json):

  • G=16 generations per prompt
  • V1-V4 verifiers with weights [0.35, 0.30, 0.15, 0.20]
  • beta=0.02, 2 iterations per batch
  • ~48h on A40

Option B: SFT Training

# Interactive session
srun -p scu-gpu --gres=gpu:1 --mem=48G -c 8 --time=4:00:00 --account=cayuga_0003 --pty bash

# Activate environment
. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
conda activate biorlhf

# Run SFT
cd /athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF
biorlhf-train --model mistralai/Mistral-7B-v0.3 --dataset data/kmp_sft_final.json --output ./my_sft_model

Option C: Interactive GPU Session

# Request GPU
srun -p scu-gpu --gres=gpu:1 --mem=48G -c 8 --time=4:00:00 --account=cayuga_0003 --pty bash

# Activate environment
. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
conda activate biorlhf

# Navigate and run
cd /athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF
biorlhf-grpo --config configs/grpo_full_v2.json

Step 4: Monitor Training

# Check job status
squeue -u $USER

# Tail logs
tail -f logs/grpo_full_*.log

# GPU usage (on compute node)
nvidia-smi

# WandB dashboard
# https://wandb.ai/jangkeun-weill-cornell-medicine/biogrpo

Environment Details

Component Version
Python 3.10
PyTorch 2.5.1+cu121
Transformers 4.57.3
TRL 0.26.2
PEFT 0.18.0

GPU Options on Cayuga

GPU VRAM Best For SLURM Flag
A40 48GB Standard GRPO/SFT with QLoRA --gres=gpu:1
A100 80GB Larger batches, faster training --gres=gpu:a100:1

Important Notes

SLURM Version

The default sbatch at /usr/bin/sbatch is outdated (v22.05.2). Use bash -l -c 'sbatch ...' to get the correct version (slurm/25.05.0) loaded via module.

Conda in Non-Interactive Shells

source ~/.bashrc does not work in non-interactive SSH. Always source conda directly:

. /home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh
conda activate biorlhf

SFT Checkpoint Symlink

The SFT model adapter is stored at:

/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/biorlhf/kmp_sft_model_final

GRPO scripts auto-symlink this into the working directory.

Batch Size with G=16

Both per_device_eval_batch_size and generation_batch_size must be divisible by num_generations. The TRL parameter is generation_batch_size, NOT per_device_generation_batch_size.

Eval Performance

GRPOTrainer's eval loop generates completions sequentially (~3 min/sample). With 107 eval samples, each eval pass takes ~5.3h. Set eval_steps=9999 to skip in-training eval; run post-hoc evaluation instead.


Troubleshooting

"CUDA out of memory"

Reduce batch size or gradient accumulation in the config JSON:

{
    "batch_size": 1,
    "gradient_accumulation_steps": 16
}

"No GPU available"

nvidia-smi                    # Check GPU allocation
squeue -u $USER               # Verify you're on a GPU node

LoRA adapter loading fails

The SFT checkpoint is a LoRA adapter, not a full model. Load base model first:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3")
model = PeftModel.from_pretrained(base, "path/to/kmp_sft_model_final")
model = model.merge_and_unload()  # Merge for GRPO training

Key Paths

Path Description
/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/BioRLHF/ Working directory
/athena/cayuga_0003/scratch/users/jak4013/otsuka/training/biorlhf/kmp_sft_model_final SFT checkpoint
/athena/cayuga_0003/scratch/users/jak4013/otsuka/data/ Data directory
/home/fs01/jak4013/miniconda3/miniconda3/etc/profile.d/conda.sh Conda init script