Instructions to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")
model = AutoModelForMultimodalLM.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/leon2k2k2k/qwen2.5-3b-countdown-sft-grpo

SGLang

How to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with Docker Model Runner:
```
docker model run hf.co/leon2k2k2k/qwen2.5-3b-countdown-sft-grpo
```

Qwen2.5-3B Countdown SFT-then-GRPO (iteration 300)

Qwen2.5-3B first supervised-fine-tuned on correct multiplication solutions (countdown-mult-sft), then trained with the same GRPO recipe for 300 iterations.

The point of this run was to test whether seeding GRPO with SFT (to install multiplication first) beats GRPO alone. It does not. GRPO restores add/sub that SFT had forgotten (19% back to 75% pass@10), but the multiplication SFT installed is pruned back to 0%, and the rigid SFT template survives, collapsing output diversity to about two distinct answers per ten tries. Stacking them keeps neither half-model's strength.

Full writeup: https://leon2k2k2k.github.io/blog/2026/grpo-sft-teaching-reasoning-through-arithmetic/ Companion: GRPO-alone model | SFT dataset

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")
tok = AutoTokenizer.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")

The model expects the Countdown prompt format: reason inside <think> </think>, give the final equation inside <answer> </answer>.

Results

300 held-out problems (150 add/sub, 150 needs-mult), 10 samples per problem at temperature 0.7.

cell	pass@1	pass@10
add/sub, 3 numbers	87.0%	89.4%
add/sub, 4 numbers	43.8%	51.8%
needs-mult, 3 numbers	0.0%	0.0%
needs-mult, 4 numbers	0.0%	0.0%

Compared with GRPO-alone, this model is a touch ahead at a single sample (71% vs 67% add/sub pass@1) but stalls with more tries (75% vs 94% add/sub pass@10): it is committed rather than exploratory.

Training

Two stages, both on one H100. (1) SFT on ~5,000 worked multiplication solutions. (2) GRPO via nano-aha-moment from the SFT checkpoint: G = 4, learning rate 1e-6, KL 0.001, temperature 1.0, 1024-token budget, 300 iterations. Reward = 1.0 well-formed + 1.0 correct.

License and attribution

This is a fine-tune of Qwen2.5-3B by the Qwen team, and is released under the same Qwen Research License. The base model and its weights are their work; this repo only adds SFT then GRPO fine-tuning on Countdown.

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for leon2k2k2k/qwen2.5-3b-countdown-sft-grpo

Base model

Qwen/Qwen2.5-3B

Finetuned

(427)

this model

leon2k2k2k
/

qwen2.5-3b-countdown-sft-grpo