Qwen2.5-Coder-7B Agentic SLM v5 LoRA

This repository contains the PEFT LoRA adapter used in the v5 agentic coding system.

Base model: Qwen/Qwen2.5-Coder-7B-Instruct

The important result is not a raw first-answer-only model win. The v5 artifact is a small coding-agent system:

Run the 7B adapter through a strict code-only generation harness.
Execute tests/verifiers.
Select the shortest passing candidate.
On misses, invoke a stronger rescue model only for the failed tasks.
Verify again and report task-level results.

This is the correct interpretation of the release: the LoRA is one component of the system. The strongest score below comes from the full verifier-rescue pipeline.

Current Proof Gate

Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue

Evaluation set: 50 HumanEval/MBPP-style coding tasks used for fast iteration.

Phase	Greedy pass@1	Coverage@K	Selected@K	Repair	Final
Qwen2.5-Coder-7B reference harness	37/50	40/50	40/50	2/50	42/50
v5 7B adapter primary	37/50	42/50	42/50	2/50	44/50
14B rescue on primary misses	1/6	3/6	3/6	1/6	4/6
v5 combined rescue system	38/50	45/50	45/50	3/50	48/50

Lift

Against the Qwen2.5-Coder-7B reference harness result of 42/50:

LoRA-only primary system: 44/50, a +2/50 absolute improvement.
LoRA-only percentage-point lift: +4 points.
LoRA-only relative lift: +4.76%.
Full v5 rescue system: 48/50, a +6/50 absolute improvement.
Full system percentage-point lift: +12 points.
Full system relative lift: +14.29%.
Failure reduction: from 8 misses to 2 misses, a 75% reduction in failures on this gate.

The honest conclusion: the LoRA alone is a small gain. The meaningful progress is from the deterministic verifier/rescue system.

What This Is Not

This is not a claimed Claude Sonnet 4.5 replacement.

This is not a broad SWE-bench win.

This is not a proof that the raw 7B weights beat frontier models.

The release is a reproducible intermediate artifact: a compact coding model plus a verifier-oriented harness that shows a measurable improvement on a fast gate.

Required Next Benchmarks

The current gate is intentionally small. It is useful for fast iteration only. Before making larger claims, the next evaluation batch must include:

LiveCodeBench: fresh contest-style coding problems, preferably recent slices only.
BigCodeBench: broader function-level and library-use coding tasks.
SWE-bench Lite or Verified subset: repository patching with real tests.
Agentic edit tasks: file editing, test execution, patch generation, and repair loops.
Cost and latency: wall-clock time, tokens generated, GPU class, and estimated dollar cost.
Abstention rate: how often the system refuses to answer or returns no valid patch.
Invalid-output rate: markdown leakage, missing entrypoint, syntax errors, test leakage, and prose leakage.
Selector diagnostics: coverage@K, selected@K, selector gap, repair@1, and false-positive verifier selections.

Recommended Evaluation Policy

Do not push all training/eval/release work inside one notebook.

Use deterministic batches:

Baseline batch: run the base model first, no training.
Candidate batch: run the candidate model/harness on the exact same tasks.
Failure batch: collect failed tasks, failed code, verifier output, and minimal repair.
Repair batch: train or prompt only on verified repair data.
Proof batch: rerun held-out tests immediately.
Release batch: publish only if the proof gate beats the previous best.

Every batch should emit JSON summaries, task-level CSV, rollouts, error signatures, and environment metadata.

Files

adapter_model.safetensors: LoRA adapter.
adapter_config.json: PEFT configuration.
v5_rescue_release_summary.json: exact proof-run summary.
v5_rescue_eval_before_after_full_code.csv: task-level proof-run table.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM-LoRA"

tokenizer = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(base, adapter_id)

For best results, use the model inside a strict code-only verifier harness. Do not evaluate it only by casual chat prompts.