Instructions to use josephmayo/Qwen2.5-agentic-7B-SLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use josephmayo/Qwen2.5-agentic-7B-SLM with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="josephmayo/Qwen2.5-agentic-7B-SLM")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("josephmayo/Qwen2.5-agentic-7B-SLM")
model = AutoModelForCausalLM.from_pretrained("josephmayo/Qwen2.5-agentic-7B-SLM")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use josephmayo/Qwen2.5-agentic-7B-SLM with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "josephmayo/Qwen2.5-agentic-7B-SLM"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "josephmayo/Qwen2.5-agentic-7B-SLM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/josephmayo/Qwen2.5-agentic-7B-SLM

SGLang

How to use josephmayo/Qwen2.5-agentic-7B-SLM with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "josephmayo/Qwen2.5-agentic-7B-SLM" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "josephmayo/Qwen2.5-agentic-7B-SLM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "josephmayo/Qwen2.5-agentic-7B-SLM" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "josephmayo/Qwen2.5-agentic-7B-SLM",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use josephmayo/Qwen2.5-agentic-7B-SLM with Docker Model Runner:
```
docker model run hf.co/josephmayo/Qwen2.5-agentic-7B-SLM
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen2.5-Coder-7B Agentic SLM v5 Merged

This repository contains the merged 7B model:

Qwen/Qwen2.5-Coder-7B-Instruct + v5 LoRA adapter.

It is the deployable dense 7B component of the v5 agentic coding system. The best measured result comes from running this model inside a deterministic verifier/rescue harness, not from raw chat usage alone.

Current Proof Gate

Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue

Evaluation set: 50 HumanEval/MBPP-style tasks used for fast iteration.

Phase	Greedy pass@1	Coverage@K	Selected@K	Repair	Final
Qwen2.5-Coder-7B reference harness	37/50	40/50	40/50	2/50	42/50
v5 7B adapter/merged primary	37/50	42/50	42/50	2/50	44/50
14B rescue on primary misses	1/6	3/6	3/6	1/6	4/6
v5 combined rescue system	38/50	45/50	45/50	3/50	48/50

Lift Summary

The 7B merged model alone improved the final harness score from 42/50 to 44/50.

That is:

+2/50 absolute tasks.
+4 percentage points.
+4.76% relative improvement over the 42/50 reference.

The full v5 rescue system improved from 42/50 to 48/50.

That is:

+6/50 absolute tasks.
+12 percentage points.
+14.29% relative improvement.
75% failure reduction, from 8 failures to 2 failures.

Interpretation

This model should be viewed as a compact coding component, not a frontier-model replacement by itself.

The practical artifact is:

7B merged model for primary code generation.
Deterministic verifier/test runner.
Candidate selection by executable tests.
Repair pass for failed candidates.
Optional rescue model for missed tasks.

The strongest result requires the harness.

Limitations

The current proof gate is small.
HumanEval/MBPP-style tasks are not enough to establish broad coding-agent quality.
No broad SWE-bench claim is made.
No Claude Sonnet 4.5 win is claimed.
Contamination risk must be handled carefully on common public coding benchmarks.

Required Next Benchmarks

Future claims should be gated by a broader eval suite:

LiveCodeBench, using recent and non-training-contaminated slices.
BigCodeBench, including realistic library/function behavior.
SWE-bench Lite, then SWE-bench Verified if the lite run is promising.
Repo-edit tasks with hidden tests.
Agentic tool-use tasks: edit, run tests, inspect failures, patch again.
Cost and latency: total wall-clock, GPU type, tokens per task, repair count, and success per dollar.
Abstention and invalid-output rates.
Robustness under strict code-only output constraints.

Batch-Based Release Discipline

The next iteration should avoid giant all-in-one notebooks.

Preferred release process:

baseline: evaluate base model only.
candidate: evaluate one candidate change only.
failure_forge: collect failed attempts and verifier observations.
repair_train: train only on verified minimal repairs.
heldout_eval: rerun held-out benchmark tasks.
release: push LoRA, merged model, and GGUF only after the gate passes.

Each batch should have a separate Kaggle notebook, capped runtime, deterministic output files, and explicit pass/fail criteria.

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)

For meaningful results, run the model in a verifier harness rather than judging raw single responses.

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

F16

Model tree for josephmayo/Qwen2.5-agentic-7B-SLM

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B

Finetuned

Qwen/Qwen2.5-Coder-7B-Instruct

Finetuned

(384)

this model

Quantizations

1 model