MoM-Python-SLM-GRPO (1.5B)

The spec-driven code-generation node of a Mixture-of-Models (MoM) mesh — the GRPO/RLVR-tuned successor to srivarenya/MoM-python-slm. Given a Python task (optionally with an upstream context packet), it returns reasoning followed by a function. It shares the Qwen2.5-Coder tokenizer with the other generative nodes, which is what makes logit-space fusion across the mesh valid.

  • Warm-started from: srivarenya/MoM-python-slm (DoRA r=64 SFT of Qwen2.5-Coder-1.5B-Instruct)
  • Method: GRPO (Group Relative Policy Optimization, RLVR) — a fresh DoRA r=64 adapter trained 500 steps, then merged.
  • Reward: 0.8 · execution + 0.1 · format + 0.1 · LLM-judge. Execution reward runs each completion against the problem's assert tests in a sandbox (binary pass/fail) — this is the load-bearing signal. Two-sided abstention: NEED_INPUT is rewarded only on underspecified prompts.
  • GRPO config: β=0 (no KL), asymmetric clip ε=[0.2, 0.25], G=8 completions/prompt, temp=0.9, top_p=0.95, lr=1e-5. Problems: 6k execution-verifiable (problem_solving + spec_to_code) + abstention records.

Benchmarks (greedy pass@1, same Colab/evalplus harness for all three)

Metric base SFT (MoM-python-slm) this model (GRPO) GRPO vs SFT
MBPP 66.7 69.6 72.5 +2.9
MBPP+ — — 62.7 —
domain problem_solving (exec) 0.700 0.713 0.767 +5.4
domain spec_to_code (exec) 0.632 0.714 0.729 +1.5
domain api_usage (application) — 0.855 0.900 +4.5
HumanEval 68.9 70.7 67.7 −3.0
HumanEval+ — — 62.2 —
domain api_signature (param-recall) 0.217 0.299 0.301 +0.0

What GRPO did (load-bearing read)

GRPO is a specialization trade, not a free lunch. Gains land on exactly the execution-rewarded, spec-driven dimensions — MBPP +2.9 and domain problem_solving +5.4 over SFT — while the un-reinforced HumanEval completion format gives back −3.0 (slightly under base). That's the textbook RLVR signature: the model sharpens "write a correct function from a spec" (what the MoM node actually does) at a small cost to "graft a body under a fixed signature" (a format it never saw a reward for).

  • Use this model for the spec-driven node role — it's the strongest on MBPP and the held-out domain eval.
  • Use the SFT sibling if HumanEval-completion is a hard gate — it remains the HumanEval-strongest checkpoint (70.7).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("srivarenya/MoM-python-slm-grpo")
model = AutoModelForCausalLM.from_pretrained(
    "srivarenya/MoM-python-slm-grpo", dtype="bfloat16", device_map="auto")

Prompt with the training system prompt + a Python task; the model returns reasoning then code. Reward, training recipe, and the self-contained GRPO Colab notebook are in the project repository.

Next cross-check: LiveCodeBench (contamination-resistant), before/after vs the SFT sibling.

Downloads last month
20
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for srivarenya/MoM-python-slm-grpo

Finetuned
(1)
this model