llamacle_drgrpo_v1_step10 — DrGRPO RL on top of llamacle_v6_clean

Continuation of ceselder/llamacle_v6_clean_step1875 via online Dr. GRPO RL on the 2,500 held-out FineWeb LoRAs from v6 pretrain. 32 prompts/cycle x K=16 rollouts, lr=7e-6, eps_low/high=0.2/0.28, NF4-quantized base + DDP across 6 B200s, sub-batched K=4x4 decode, forward_ckpt_inject for backward.

This is step 10 of 80 (early checkpoint).

Score progression (judge 1-10, mean over kept rollouts)

cycle 1: 4.08 cycle 2: 3.85 cycle 3: 4.45 cycle 4: 4.95 cycle 5: 4.95 cycle 6: 5.17 cycle 7: 5.77 cycle 8: 4.21 cycle 9: 4.82 cycle 10: 4.73

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/llamacle_drgrpo_v1_step10

Base model

meta-llama/Llama-3.1-70B

Finetuned

meta-llama/Llama-3.3-70B-Instruct

Finetuned

(614)

this model