Project: Post-Training a pre-trained model to be able to answer closed-book factual questions in the domain of general knowledge.

Format requirement: The model should enclose its final answer in \boxed{}.

Context: Modern Natural Language Processing course (EPFL, CS-552).

Qwen3-1.7B post-trained using RLVR with GRPO on MMLU training set (3000 samples).

Used LoRA with following config:

lora_config = LoraConfig(
    r = 16, 
    lora_alpha = 32,
    lora_dropout = 0.0,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)

Training arguments:

training_args = GRPOConfig(
    output_dir = OUTPUT_DIR,
    learning_rate = 1e-6,
    num_generations = N_GROUP,
    max_steps = 1500,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 8,
    max_completion_length = 3090,
    use_vllm = True,
    vllm_gpu_memory_utilization = 0.55,
    #vllm_max_model_len = 4096,
)

Finally:

merged_model.generation_config = GenerationConfig(
    bos_token_id=tokenizer.bos_token_id if tokenizer.bos_token_id is not None else tokenizer.eos_token_id,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    temperature=0.6,
    top_k=20,
    top_p=0.95,
)

RESULTS

For all results here after, 4K tokens allowed at inference.

Model Accuracy (%)
(Internal Benchmark)
Missing \boxed{} (%)
(Internal Benchmark)
Pre-trained Qwen3-1.7B 68.94 1.0
GRPO post-trained model 64.68% 8.9%
Downloads last month
18
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JoanneJegou/GRPO_post_trained_v1

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(805)
this model

Collection including JoanneJegou/GRPO_post_trained_v1