cs224r-default-project-rloo

RLOO (REINFORCE Leave-One-Out) fine-tuned model for the Countdown arithmetic reasoning task, built on top of an SFT baseline. Trained as part of Stanford CS224R (Spring 2026).

Model Description

This model is trained with online reinforcement learning using the RLOO algorithm. Given a target number and a set of allowed numbers, the model produces chain-of-thought reasoning inside <think> tags and a final answer inside <answer> tags. A rule-based verifier rewards correct arithmetic equations (score 1.0), correctly formatted but incorrect equations (score 0.1), and malformed outputs (score 0.0).

Training Details

Hyperparameter Value
Base model ba144220/cs224r-default-project-sft (SFT-tuned Qwen2.5-0.5B)
Algorithm RLOO (REINFORCE Leave-One-Out)
Dataset asingh15/countdown_tasks_3to4
Learning rate 1e-5 (constant schedule)
Batch size 128 (gradient accumulation = 128)
Group size (K) 8
Entropy coefficient 0.001
KL divergence coefficient 0.001
Importance weighting Disabled
Weight decay 1e-4
Gradient clipping 1.0
Temperature 1.0
Max completion length 1024
Training steps 100
Precision bfloat16
Hardware 1x NVIDIA H100 (Modal)

Evaluation

Evaluated on asingh15/countdown_tasks_3to4 test split (50 prompts) using vLLM with temperature 0.6, top-k 20, top-p 0.95, sampling K=16 responses per prompt.

Metric SFT Baseline IPO RLOO (this model)
Average Score 0.3660 0.4080 0.6407
Pass@1 0.30 0.375 0.6407
Pass@16 0.75 (30/40) 0.75 (30/40) 0.78 (39/50)
Correct (score=1.0) 244/800 287/800 491/800

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo")
tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo")

messages = [{"role": "user", "content": "Using the numbers [3, 4, 6, 8], create an equation that equals 24."}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_k=20, top_p=0.95, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Limitations

  • Trained and evaluated only on the Countdown arithmetic task; not intended for general-purpose use.
  • Performance degrades on harder problems with more numbers or larger targets.
  • The 0.5B parameter size limits reasoning capacity compared to larger models.

Authors

Yuchi Hsu (yuchihsu@stanford.edu) and Ryan He (ryanhe@stanford.edu), Stanford CS224R Spring 2026.

Downloads last month
17
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ba144220/cs224r-default-project-rloo

Finetuned
(2)
this model

Dataset used to train ba144220/cs224r-default-project-rloo

Evaluation results