DL HW3 GRPO LoRA Adapter

This repository contains the final LoRA adapter for DL HW3: Reasoning LLM Step 3 with GRPO.

Base Model

Qwen/Qwen2.5-14B-Instruct

The model was initialized from my Step2 SFT LoRA adapter and further optimized using GRPO.

outputs/grpo_hw2best_balanced_30steps_lr2e8

The final GRPO run used a balanced version of HW2_.csv:

Final inference uses score-only A/B/C/D log-softmax scoring.

Public LB score: around 0.71

Base model

Finetuned

Adapter

(359)

this model