MoM-Python-SLM (1.5B)

The Python code-generation node of a Mixture-of-Models (MoM) mesh — a set of small, specialized Qwen2.5-Coder SLMs (shared tokenizer) coordinated by a lightweight router, aiming to beat frontier generalists on coding by specialization depth rather than parameter count.

This node is a single-turn code generator (not an agent): given a Python task (optionally with an upstream context packet), it returns reasoning followed by code. It shares the Qwen2.5-Coder tokenizer with the other generative nodes, which is what makes logit-space fusion across the mesh valid.

  • Base: Qwen/Qwen2.5-Coder-1.5B-Instruct
  • Method: DoRA r=64 (≈4.6% trainable), SFT (Phase A 1ep + Phase B 2ep), then merged.
  • Data: 476K instances (decontaminated vs HumanEval/MBPP, 0 overlap) built from the complete CPython docs + Flask/Requests source, issues/PRs, CVEs, and execution-verified synthetic problems.

Benchmarks (greedy pass@1)

Suite Metric base this model
HumanEval pass@1 68.9 70.7
MBPP pass@1 66.7 69.6
Domain (held-out) spec_to_code exec 0.632 0.714 (+8.2)
Domain (held-out) api_signature param-recall 0.217 0.299 (+8.2)
Domain (held-out) problem_solving exec 0.700 0.713 (parity)

The largest gains are on library/API capability (writing correct code from a spec, recalling API signatures) — the dimension HumanEval/MBPP are saturated on and can't measure. The repo's self-contained domain-eval notebook reproduces these.

Recipe findings (load-bearing)

  • Low DoRA rank wins: r=64 specializes without forgetting; r=256 catastrophically regressed (HumanEval 60.4 < base).
  • Moderate reasoning wins: the ~25%-reasoning recipe (this model) beat a 98%-reasoning sibling, whose HumanEval collapsed to 47 (always-reason prose fights the signature-completion format).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("srivarenya/MoM-python-slm")
model = AutoModelForCausalLM.from_pretrained(
    "srivarenya/MoM-python-slm", dtype="bfloat16", device_map="auto")

Prompt with the training system prompt + a Python task; the model returns reasoning then code.

Next step in the pipeline: GRPO/RLVR against an execution-grounded reward to push past the instruct-tuning ceiling. Code, training recipe, and eval harnesses: project repository.

Downloads last month
16
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for srivarenya/MoM-python-slm

Finetuned
(180)
this model