Instructions to use ba144220/cs224r-default-project-rloo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ba144220/cs224r-default-project-rloo with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ba144220/cs224r-default-project-rloo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo")
model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ba144220/cs224r-default-project-rloo with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ba144220/cs224r-default-project-rloo"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ba144220/cs224r-default-project-rloo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ba144220/cs224r-default-project-rloo

SGLang

How to use ba144220/cs224r-default-project-rloo with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ba144220/cs224r-default-project-rloo" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ba144220/cs224r-default-project-rloo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ba144220/cs224r-default-project-rloo" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ba144220/cs224r-default-project-rloo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ba144220/cs224r-default-project-rloo with Docker Model Runner:
```
docker model run hf.co/ba144220/cs224r-default-project-rloo
```

cs224r-default-project-rloo

RLOO (REINFORCE Leave-One-Out) fine-tuned model for the Countdown arithmetic reasoning task, built on top of an SFT baseline. Trained as part of Stanford CS224R (Spring 2026).

Model Description

This model is trained with online reinforcement learning using the RLOO algorithm. Given a target number and a set of allowed numbers, the model produces chain-of-thought reasoning inside <think> tags and a final answer inside <answer> tags. A rule-based verifier rewards correct arithmetic equations (score 1.0), correctly formatted but incorrect equations (score 0.1), and malformed outputs (score 0.0).

Training Details

Hyperparameter	Value
Base model	ba144220/cs224r-default-project-sft (SFT-tuned Qwen2.5-0.5B)
Algorithm	RLOO (REINFORCE Leave-One-Out)
Dataset	asingh15/countdown_tasks_3to4
Learning rate	1e-5 (constant schedule)
Batch size	128 (gradient accumulation = 128)
Group size (K)	8
Entropy coefficient	0.001
KL divergence coefficient	0.001
Importance weighting	Disabled
Weight decay	1e-4
Gradient clipping	1.0
Temperature	1.0
Max completion length	1024
Training steps	100
Precision	bfloat16
Hardware	1x NVIDIA H100 (Modal)

Evaluation

Evaluated on asingh15/countdown_tasks_3to4 test split (50 prompts) using vLLM with temperature 0.6, top-k 20, top-p 0.95, sampling K=16 responses per prompt.

Metric	SFT Baseline	IPO	RLOO (this model)
Average Score	0.3660	0.4080	0.6407
Pass@1	0.30	0.375	0.6407
Pass@16	0.75 (30/40)	0.75 (30/40)	0.78 (39/50)
Correct (score=1.0)	244/800	287/800	491/800

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo")
tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo")

messages = [{"role": "user", "content": "Using the numbers [3, 4, 6, 8], create an equation that equals 24."}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_k=20, top_p=0.95, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Limitations

Trained and evaluated only on the Countdown arithmetic task; not intended for general-purpose use.
Performance degrades on harder problems with more numbers or larger targets.
The 0.5B parameter size limits reasoning capacity compared to larger models.

Authors

Yuchi Hsu (yuchihsu@stanford.edu) and Ryan He (ryanhe@stanford.edu), Stanford CS224R Spring 2026.

Downloads last month: 17

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for ba144220/cs224r-default-project-rloo

Base model

Qwen/Qwen2.5-0.5B

Finetuned

ba144220/cs224r-default-project-sft

Finetuned

(2)

this model

Dataset used to train ba144220/cs224r-default-project-rloo

Evaluation results

Average Score on Countdown Tasks 3-to-4
test set self-reported

0.641
Pass@16 on Countdown Tasks 3-to-4
test set self-reported

0.780