Qwen2.5-Coder-7B Agentic SLM v5 Merged

This repository contains the merged 7B model:

Qwen/Qwen2.5-Coder-7B-Instruct + v5 LoRA adapter.

It is the deployable dense 7B component of the v5 agentic coding system. The best measured result comes from running this model inside a deterministic verifier/rescue harness, not from raw chat usage alone.

Current Proof Gate

Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue

Evaluation set: 50 HumanEval/MBPP-style tasks used for fast iteration.

Phase Greedy pass@1 Coverage@K Selected@K Repair Final
Qwen2.5-Coder-7B reference harness 37/50 40/50 40/50 2/50 42/50
v5 7B adapter/merged primary 37/50 42/50 42/50 2/50 44/50
14B rescue on primary misses 1/6 3/6 3/6 1/6 4/6
v5 combined rescue system 38/50 45/50 45/50 3/50 48/50

Lift Summary

The 7B merged model alone improved the final harness score from 42/50 to 44/50.

That is:

  • +2/50 absolute tasks.
  • +4 percentage points.
  • +4.76% relative improvement over the 42/50 reference.

The full v5 rescue system improved from 42/50 to 48/50.

That is:

  • +6/50 absolute tasks.
  • +12 percentage points.
  • +14.29% relative improvement.
  • 75% failure reduction, from 8 failures to 2 failures.

Interpretation

This model should be viewed as a compact coding component, not a frontier-model replacement by itself.

The practical artifact is:

  • 7B merged model for primary code generation.
  • Deterministic verifier/test runner.
  • Candidate selection by executable tests.
  • Repair pass for failed candidates.
  • Optional rescue model for missed tasks.

The strongest result requires the harness.

Limitations

  • The current proof gate is small.
  • HumanEval/MBPP-style tasks are not enough to establish broad coding-agent quality.
  • No broad SWE-bench claim is made.
  • No Claude Sonnet 4.5 win is claimed.
  • Contamination risk must be handled carefully on common public coding benchmarks.

Required Next Benchmarks

Future claims should be gated by a broader eval suite:

  • LiveCodeBench, using recent and non-training-contaminated slices.
  • BigCodeBench, including realistic library/function behavior.
  • SWE-bench Lite, then SWE-bench Verified if the lite run is promising.
  • Repo-edit tasks with hidden tests.
  • Agentic tool-use tasks: edit, run tests, inspect failures, patch again.
  • Cost and latency: total wall-clock, GPU type, tokens per task, repair count, and success per dollar.
  • Abstention and invalid-output rates.
  • Robustness under strict code-only output constraints.

Batch-Based Release Discipline

The next iteration should avoid giant all-in-one notebooks.

Preferred release process:

  1. baseline: evaluate base model only.
  2. candidate: evaluate one candidate change only.
  3. failure_forge: collect failed attempts and verifier observations.
  4. repair_train: train only on verified minimal repairs.
  5. heldout_eval: rerun held-out benchmark tasks.
  6. release: push LoRA, merged model, and GGUF only after the gate passes.

Each batch should have a separate Kaggle notebook, capped runtime, deterministic output files, and explicit pass/fail criteria.

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)

For meaningful results, run the model in a verifier harness rather than judging raw single responses.

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for josephmayo/Qwen2.5-agentic-7B-SLM

Base model

Qwen/Qwen2.5-7B
Finetuned
(384)
this model
Quantizations
1 model