Instructions to use williyam/redrob-qwen-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use williyam/redrob-qwen-grpo with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="williyam/redrob-qwen-grpo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("williyam/redrob-qwen-grpo")
model = AutoModelForCausalLM.from_pretrained("williyam/redrob-qwen-grpo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use williyam/redrob-qwen-grpo with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "williyam/redrob-qwen-grpo"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "williyam/redrob-qwen-grpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/williyam/redrob-qwen-grpo

SGLang

How to use williyam/redrob-qwen-grpo with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "williyam/redrob-qwen-grpo" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "williyam/redrob-qwen-grpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "williyam/redrob-qwen-grpo" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "williyam/redrob-qwen-grpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use williyam/redrob-qwen-grpo with Docker Model Runner:
```
docker model run hf.co/williyam/redrob-qwen-grpo
```

redrob-qwen-grpo

Qwen/Qwen3-0.6B → GRPO-fine-tuned for explainable candidate ranking, under a rule-based reward model (no LLM-as-a-judge).

This is the open-source side-quest of the Talentry-AI submission to the Redrob × Hack2Skill — India Runs Data & AI Challenge.

The base Talentry-AI ranker is fully deterministic and runs with 0 LLM calls. This checkpoint exists for anyone who wants an LLM-flavoured candidate ranker that has been trained against the same rule-based rubric Talentry-AI uses to audit its own decisions. The Talentry-AI submission itself does not depend on this model.

Headline results

Metric	Baseline (`Qwen/Qwen3-0.6B`)	`redrob-qwen-grpo`	Δ
Mean rule-based reward `[0,1]`	0.539	0.713	+0.173
Eval episodes	12	12	—
Hardware	Apple M1 Pro 16 GB · MPS	Apple M1 Pro 16 GB · MPS	—
Eval `max_new_tokens`	384	384	—

The same deterministic eval rollout (seed=0, sequential, identical prompts) is used for both rows so the comparison is fair.

Per-component improvement (rule-based reward, mean over eval episodes)

Reward component	Baseline	Trained	Δ
`format_valid`	0.833	1.000	+0.167
`decision_match`	0.500	0.500	+0.000
`score_alignment`	0.373	0.653	+0.280
`reason_quality`	0.000	0.778	+0.778
`length_penalty`	1.000	1.000	+0.000
`no_hallucination`	0.779	0.656	-0.124
`total`	0.539	0.713	+0.173

All components are in [0, 1]. total is the weighted convex combination (see reward.py).

Quick usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("williyam/redrob-qwen-grpo")
mdl = AutoModelForCausalLM.from_pretrained(
    "williyam/redrob-qwen-grpo", dtype=torch.float32
).eval()

system = (
    "You are RedRob, an explainable candidate-ranking assistant. "
    "Decide whether the candidate should be SHORTLISTED for the role. "
    "Respond with a single JSON object: "
    '{"decision":"shortlist"|"reject","score":0..1,"reasons":[..]}.'
)
user = (
    "[JOB DESCRIPTION]\n<your JD here>\n\n"
    "[CANDIDATE]\n<candidate profile>"
)

prompt = tok.apply_chat_template(
    [
        {"role": "system", "content": system},
        {"role": "user", "content": user},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt")
out = mdl.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tok.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))

The model is expected to return:

{
  "decision": "shortlist" | "reject",
  "score": 0.0-1.0,
  "reasons": ["short, grounded bullet", "..."]
}

Training summary

Aspect	Value
Base model	`Qwen/Qwen3-0.6B` (600M params, Qwen3 chat template)
Algorithm	GRPO (TRL `GRPOTrainer`)
Reward signal	Rule-based (no LLM judge): six interpretable components
Reward components	`format_valid`, `decision_match`, `score_alignment`, `reason_quality`, `length_penalty`, `no_hallucination`
Optimiser steps	10 (deliberately short — sample-efficient demo on a laptop GPU)
`num_generations`	2 (group size; 2-arm advantage estimate)
KL coefficient `β`	0.04
Learning rate	5e-6
Sampling temperature / top-p	1.0 / 0.95
Max completion length	96 tokens (training); 512 tokens (eval, this card)
Hardware	Apple M1 Pro 16 GB · MPS (`bf16=False`, `fp16=False`, `fp32`)
Gradient checkpointing	Yes (`use_reentrant=False`)
Training wall-clock	~4.5 minutes for 10 steps

Full training config: configs/grpo_qwen3_0p6b.yaml.

Reward model (no LLM judge)

Every completion is graded by RuleBasedRewardModel on six components, each clipped to [0, 1]:

Component	What it measures
`format_valid`	Output parses as `{"decision","score","reasons"}` JSON.
`decision_match`	Matches gold `"shortlist" / "reject"` label.
`score_alignment`	`1 - │pred_score - gold_score│`.
`reason_quality`	2–5 short, diverse reasons that aren't copy-pasted from the input.
`length_penalty`	Stays inside a sensible character budget.
`no_hallucination`	Proper nouns / numbers in reasons all appear in the JD or candidate text.

Total reward = convex combination (weights documented in the dataclass), so total ∈ [0, 1].

Plots

The four training plots are committed to this repo and rendered inline below:

Training curves Baseline vs trained

Reward components Reward distribution

File	Description
`training_curves.png`	Mean reward `[0,1]` (left axis) + GRPO loss (right axis) vs train step.
`baseline_vs_trained.png`	Per-episode reward on the same eval rollout, baseline vs trained.
`reward_components.png`	Mean value of each rule-based reward component, baseline vs trained.
`reward_distribution.png`	Histogram of episode rewards across the eval rollout.

Intended use

Educational / research — show how GRPO with a rule-based reward shapes a small open-source LLM toward a structured JSON output schema for a real-world hiring-adjacent task.
Drop-in component — for anyone who wants to plug an LLM ranker into a candidate-shortlisting pipeline and get an auditable JSON {decision, score, reasons} response.
Reference implementation — the entire training loop, env, and reward model are open-source under MIT (source).

Out-of-scope / limitations

Not a substitute for human review. This model produces a score and reasons; final hiring decisions must always involve a human reviewer.
Trained on a 30-sample distilled fixture of the Redrob hackathon's candidate pool — it is not trained on the full 100K candidate population and will not generalise to arbitrary new JDs without fine-tuning on your own data.
Short training run (10 GRPO steps). The reward shapes can move meaningfully more with longer training; this checkpoint is the hackathon-submission burst, not a SOTA result.
Single-language (English).
Possible biases inherited from Qwen/Qwen3-0.6B's pre-training data and from the synthetic dataset of 50 Redrob candidates.
Honeypot resistance is provided by Talentry-AI's deterministic pipeline, not by this checkpoint — the LLM here cannot, by itself, detect "8 years at a 3-year-old company"-style impossibilities.

Citation

If you use this checkpoint, please cite:

@misc{redrob_qwen_grpo_2026,
  title  = {redrob-qwen-grpo: GRPO fine-tune of Qwen3-0.6B for explainable candidate ranking},
  author = {Williyam M},
  year   = {2026},
  url    = {https://huggingface.co/williyam/redrob-qwen-grpo},
  note   = {Open-source artifact from the Talentry-AI / Redrob × Hack2Skill - India Runs submission.}
}

License

MIT — see the Talentry-AI LICENSE.

Acknowledgements

Qwen/Qwen3-0.6B from the Qwen team.
trl for the GRPO implementation.
Redrob × Hack2Skill — India Runs for the JD + 50-candidate fixture.

Downloads last month: 115

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for williyam/redrob-qwen-grpo

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B