Instructions to use amarsaikhan/spark-code-C-reg2-3b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use amarsaikhan/spark-code-C-reg2-3b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct") model = PeftModel.from_pretrained(base_model, "amarsaikhan/spark-code-C-reg2-3b") - Notebooks
- Google Colab
- Kaggle
SPARK-Code · Condition C-reg2 (Regularized Co-Evolve, full pool, 6 iterations) · Qwen2.5-Coder-3B QLoRA
QLoRA adapter trained with regularized SPARK-style co-evolution on the full 311-problem MBPP pool over 6 iterations. A cautionary result: the regularized recipe delays but does not prevent drift over a longer schedule — HumanEval pass@1 regresses −2.2 pp and KL climbs to ~0.096 by iteration 6.
TL;DR
spark-code-C-reg2-3b extends the regularized co-evolve recipe (spark-code-C-reg-3b) to the full 311-problem MBPP pool and a 6-iteration schedule, with slightly stronger KL regularization (kl_coeff=0.03) and a smaller auxiliary loss (aux_loss_scale=0.02, pairwise weight dropped to 0.05). Unlike the shorter 3-iteration C-reg run — which matched the exec-only baseline within noise — this longer run drifts: HumanEval pass@1 falls from 0.796 to 0.774 (−2.2 pp), GRPO KL climbs steadily to ~0.096, and mean completion length contracts ~54%. The held-out MBPP pass@5 peaks early (0.71 at iter 3) and decays back to baseline by iter 6. The published weights are the completed iteration-6 state. This card documents a negative/cautionary result: more iterations are not better for the co-evolve recipe; the sweet spot here was ~iteration 3.
Training Setup
- Base model:
Qwen/Qwen2.5-Coder-3B-Instruct - Method: GRPO (exec-only reward, partial per-test scoring, frozen-reference KL) + auxiliary SFT phase per iteration. Auxiliary examples mined from each iteration's rollouts:
- Pointwise — binary "Correct/Incorrect" judgments over a single candidate
- Pairwise — randomized A/B preference between a passing and a failing rollout
- Reflection — execution-grounded repair, target = a sibling correct rollout or the MBPP canonical solution (
reflection_target_mode=correct_or_canonical)
- Training data: MBPP-sanitized, 311 problems (full pool), 6 iterations (completed), K=4 adaptive rollouts (up to 8), partial per-test rewards with
syntax_penalty=-0.2,runtime_penalty=-0.1,timeout_penalty=-0.3. - LoRA:
r=16,alpha=32,dropout=0.05, targetsq_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj. - Quantization: 4-bit NF4 + double quant, bf16 compute.
- Optimizer: AdamW,
lr=5e-6,grad_accum=4,clip_ratio=0.2,max_grad_norm=1.0. - GRPO KL:
kl_coeff=0.03against the frozen reference policy. - Aux hyperparameters (this run):
aux_loss_scale=0.02,aux_weight_pointwise=0.0,aux_weight_pairwise=0.05,aux_weight_reflection=1.0,aux_epochs=1,aux_max_len=1024. - Aux pool sizes (after caps): iter 1 → 769 (pointwise 200 / pairwise 200 / reflection 369), iter 2 → 784 (200/200/384), iter 3 → 754 (200/200/354), iter 4 → 738 (200/198/340), iter 5 → 740 (200/200/340), iter 6 → 682 (200/178/304).
- Seed: 42.
Training script: run_experiment_with_mbpp_heldout.py in the GitHub repo.
Evaluation Results
HumanEval is evaluated with 5 samples per problem at temperature=0.2, top_p=0.95. Held-out MBPP uses 100 problems disjoint from the training pool. "Reflection fix rate" is measured on the HumanEval held-out problems: for each failed first-pass generation the model is asked to repair its own code, and the fix is re-executed.
| Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL | Refl. fix rate |
|---|---|---|---|---|---|---|---|
| 0 | 0.796 | 0.854 | 0.634 | 0.680 | — | — | 0.118 |
| 1 | 0.799 | 0.848 | 0.638 | 0.700 | 0.593 | 0.0003 | 0.061 |
| 2 | 0.788 | 0.829 | 0.634 | 0.690 | 0.603 | 0.0157 | 0.057 |
| 3 | 0.796 | 0.829 | 0.644 | 0.710 | 0.628 | 0.0485 | 0.061 |
| 4 | 0.788 | 0.817 | 0.626 | 0.660 | 0.646 | 0.0592 | 0.059 |
| 5 | 0.771 | 0.823 | 0.628 | 0.670 | 0.657 | 0.0822 | 0.081 |
| 6 | 0.774 | 0.823 | 0.632 | 0.680 | 0.696 | 0.0957 | 0.083 |
Trajectory. HumanEval pass@1 holds near baseline through iter 3 (0.799 → 0.788 → 0.796), then declines through iters 4–6 to 0.774 (−2.2 pp). Held-out MBPP pass@5 peaks at iter 3 (0.71) and decays back to the baseline 0.68 by iter 6. Two drift signals grow monotonically and explain the regression:
- GRPO KL climbs steadily from 3e-4 (iter 1) to 0.096 (iter 6) — roughly a 40× the matched exec-only run (A-v2, KL ~0.0024 at iter 5). The auxiliary objective accumulates off-distribution pressure that the
kl_coeff=0.03term only partly contains. - Completion length contracts ~54% — mean tokens per GRPO sequence fall from 185 (iter 1) to 86 (iter 6), the same shortening signature seen in the naive co-evolve run, here arriving more slowly.
The reflection fix rate stays low throughout (0.06–0.08), below the untrained baseline (0.118). Train pass rate keeps rising (0.593 → 0.696) even as held-out HumanEval falls — i.e. the model is fitting the training pool while losing cross-benchmark generalization.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-3B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-C-reg2-3b")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")
prompt = tok.apply_chat_template(
[{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
{"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Comparison to Other Conditions
All five adapters share the same base model and seed. The original three used a 200-problem pool over 3 iterations; the two -v2/2 adapters use the full 311-problem pool over 6 iterations. Each row reports that adapter's published checkpoint.
| Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 |
|---|---|---|---|---|---|
| C-reg2 (regularized, full) — this card | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 |
| A-v2 (exec-only, full) | 311 / it 4 | 0.00 | 0.02 | 0.816 | 0.710 |
| A (exec-only) | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 |
| C-reg (regularized) | 200 / it 3 | 0.03 | 0.02 | 0.800 | 0.710 |
| C-light (naive) | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 |
At its published checkpoint C-reg2 sits at the bottom on HumanEval pass@1, tied with the naive co-evolve run — the longer schedule erased the stability advantage that the shorter C-reg run had.
Findings Summary
- Regularization delays drift; it does not prevent it. Over 6 iterations the regularized recipe still drifts (KL → 0.096, completion length −54%) and regresses on HumanEval (−2.2 pp). The shorter 3-iteration C-reg matched the baseline precisely because it stopped before the drift compounded.
- The auxiliary objective is the destabilizer, not the schedule. The matched exec-only run on the same 311-pool / 6-iteration budget (A-v2) stayed at KL ~0.0024 and reached the study's best HumanEval (0.816). Same data, same length, no aux → no drift.
- The operating sweet spot was ~iteration 3. Held-out MBPP pass@5 peaked at 0.71 (iter 3) and HumanEval was still at baseline there. An early-stopping or checkpoint-at-iter-3 policy would have captured the recipe's benefit without the later regression. The published iter-6 weights are the completed run, not the best checkpoint.
Related Artifacts
- Sibling adapters: spark-code-A-3b-v2 · spark-code-A-3b · spark-code-C-light-3b · spark-code-C-reg-3b
- GitHub repository: https://github.com/amarsaikhanb/spark-code
- Full per-problem eval data (HumanEval, held-out MBPP, and reflection JSONs, iters 0–6) lives under
condition_C/eval/in the repository - Interactive demo Space: [SPACES_URL]
Citation
@misc{batjargal2026sparkcode,
title = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
author = {Amarsaikhan Batjargal},
year = {2026},
}
License
The LoRA adapter weights in this repository are released under the Apache 2.0 license. The base model, Qwen/Qwen2.5-Coder-3B-Instruct, is distributed under the Tongyi Qianwen LICENSE; any downstream use must comply with its terms.
- Downloads last month
- 32