Instructions to use ba144220/cs224r-default-project-rloo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ba144220/cs224r-default-project-rloo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ba144220/cs224r-default-project-rloo") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo") model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ba144220/cs224r-default-project-rloo with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ba144220/cs224r-default-project-rloo" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ba144220/cs224r-default-project-rloo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ba144220/cs224r-default-project-rloo
- SGLang
How to use ba144220/cs224r-default-project-rloo with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ba144220/cs224r-default-project-rloo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ba144220/cs224r-default-project-rloo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ba144220/cs224r-default-project-rloo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ba144220/cs224r-default-project-rloo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ba144220/cs224r-default-project-rloo with Docker Model Runner:
docker model run hf.co/ba144220/cs224r-default-project-rloo
cs224r-default-project-rloo
RLOO (REINFORCE Leave-One-Out) fine-tuned model for the Countdown arithmetic reasoning task, built on top of an SFT baseline. Trained as part of Stanford CS224R (Spring 2026).
Model Description
This model is trained with online reinforcement learning using the RLOO algorithm. Given a target number and a set of allowed numbers, the model produces chain-of-thought reasoning inside <think> tags and a final answer inside <answer> tags. A rule-based verifier rewards correct arithmetic equations (score 1.0), correctly formatted but incorrect equations (score 0.1), and malformed outputs (score 0.0).
Training Details
| Hyperparameter | Value |
|---|---|
| Base model | ba144220/cs224r-default-project-sft (SFT-tuned Qwen2.5-0.5B) |
| Algorithm | RLOO (REINFORCE Leave-One-Out) |
| Dataset | asingh15/countdown_tasks_3to4 |
| Learning rate | 1e-5 (constant schedule) |
| Batch size | 128 (gradient accumulation = 128) |
| Group size (K) | 8 |
| Entropy coefficient | 0.001 |
| KL divergence coefficient | 0.001 |
| Importance weighting | Disabled |
| Weight decay | 1e-4 |
| Gradient clipping | 1.0 |
| Temperature | 1.0 |
| Max completion length | 1024 |
| Training steps | 100 |
| Precision | bfloat16 |
| Hardware | 1x NVIDIA H100 (Modal) |
Evaluation
Evaluated on asingh15/countdown_tasks_3to4 test split (50 prompts) using vLLM with temperature 0.6, top-k 20, top-p 0.95, sampling K=16 responses per prompt.
| Metric | SFT Baseline | IPO | RLOO (this model) |
|---|---|---|---|
| Average Score | 0.3660 | 0.4080 | 0.6407 |
| Pass@1 | 0.30 | 0.375 | 0.6407 |
| Pass@16 | 0.75 (30/40) | 0.75 (30/40) | 0.78 (39/50) |
| Correct (score=1.0) | 244/800 | 287/800 | 491/800 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ba144220/cs224r-default-project-rloo")
tokenizer = AutoTokenizer.from_pretrained("ba144220/cs224r-default-project-rloo")
messages = [{"role": "user", "content": "Using the numbers [3, 4, 6, 8], create an equation that equals 24."}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_k=20, top_p=0.95, do_sample=True)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Limitations
- Trained and evaluated only on the Countdown arithmetic task; not intended for general-purpose use.
- Performance degrades on harder problems with more numbers or larger targets.
- The 0.5B parameter size limits reasoning capacity compared to larger models.
Authors
Yuchi Hsu (yuchihsu@stanford.edu) and Ryan He (ryanhe@stanford.edu), Stanford CS224R Spring 2026.
- Downloads last month
- 17
Model tree for ba144220/cs224r-default-project-rloo
Dataset used to train ba144220/cs224r-default-project-rloo
Evaluation results
- Average Score on Countdown Tasks 3-to-4test set self-reported0.641
- Pass@16 on Countdown Tasks 3-to-4test set self-reported0.780