Instructions to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="leon2k2k2k/qwen2.5-3b-countdown-sft-grpo") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo") model = AutoModelForMultimodalLM.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/leon2k2k2k/qwen2.5-3b-countdown-sft-grpo
- SGLang
How to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "leon2k2k2k/qwen2.5-3b-countdown-sft-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use leon2k2k2k/qwen2.5-3b-countdown-sft-grpo with Docker Model Runner:
docker model run hf.co/leon2k2k2k/qwen2.5-3b-countdown-sft-grpo
Qwen2.5-3B Countdown SFT-then-GRPO (iteration 300)
Qwen2.5-3B first supervised-fine-tuned on correct multiplication solutions (countdown-mult-sft), then trained with the same GRPO recipe for 300 iterations.
The point of this run was to test whether seeding GRPO with SFT (to install multiplication first) beats GRPO alone. It does not. GRPO restores add/sub that SFT had forgotten (19% back to 75% pass@10), but the multiplication SFT installed is pruned back to 0%, and the rigid SFT template survives, collapsing output diversity to about two distinct answers per ten tries. Stacking them keeps neither half-model's strength.
Full writeup: https://leon2k2k2k.github.io/blog/2026/grpo-sft-teaching-reasoning-through-arithmetic/ Companion: GRPO-alone model | SFT dataset
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")
tok = AutoTokenizer.from_pretrained("leon2k2k2k/qwen2.5-3b-countdown-sft-grpo")
The model expects the Countdown prompt format: reason inside <think> </think>, give the final
equation inside <answer> </answer>.
Results
300 held-out problems (150 add/sub, 150 needs-mult), 10 samples per problem at temperature 0.7.
| cell | pass@1 | pass@10 |
|---|---|---|
| add/sub, 3 numbers | 87.0% | 89.4% |
| add/sub, 4 numbers | 43.8% | 51.8% |
| needs-mult, 3 numbers | 0.0% | 0.0% |
| needs-mult, 4 numbers | 0.0% | 0.0% |
Compared with GRPO-alone, this model is a touch ahead at a single sample (71% vs 67% add/sub pass@1) but stalls with more tries (75% vs 94% add/sub pass@10): it is committed rather than exploratory.
Training
Two stages, both on one H100. (1) SFT on ~5,000 worked multiplication solutions. (2) GRPO via nano-aha-moment from the SFT checkpoint: G = 4, learning rate 1e-6, KL 0.001, temperature 1.0, 1024-token budget, 300 iterations. Reward = 1.0 well-formed + 1.0 correct.
License and attribution
This is a fine-tune of Qwen2.5-3B by the Qwen team, and is released under the same Qwen Research License. The base model and its weights are their work; this repo only adds SFT then GRPO fine-tuning on Countdown.
- Downloads last month
- -
Model tree for leon2k2k2k/qwen2.5-3b-countdown-sft-grpo
Base model
Qwen/Qwen2.5-3B