Instructions to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="groxaxo/Qwen3.5-24.5B-Reapped-v1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3.5-24.5B-Reapped-v1")
model = AutoModelForCausalLM.from_pretrained("groxaxo/Qwen3.5-24.5B-Reapped-v1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "groxaxo/Qwen3.5-24.5B-Reapped-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "groxaxo/Qwen3.5-24.5B-Reapped-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/groxaxo/Qwen3.5-24.5B-Reapped-v1

SGLang

How to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "groxaxo/Qwen3.5-24.5B-Reapped-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "groxaxo/Qwen3.5-24.5B-Reapped-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "groxaxo/Qwen3.5-24.5B-Reapped-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "groxaxo/Qwen3.5-24.5B-Reapped-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with Docker Model Runner:
```
docker model run hf.co/groxaxo/Qwen3.5-24.5B-Reapped-v1
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen3.5-24.5B-Reapped-v1

A leaner, coding-sharpened Qwen3.5 MoE. This model takes a 35B-class Qwen3.5 Mixture-of-Experts, REAPs away ~30% of its experts to land at ~24.5B total parameters (≈3B active per token), then bakes in a coding/agentic LoRA so the slimmer network punches well above its memory footprint.

Smaller resident weights. Same ~3B active compute per token. A coder's attitude welded on.

Why it exists

Modern MoE models carry a lot of expert capacity you don't always need. REAP (Router-weighted Expert Activation Pruning) ranks experts by how much the router actually relies on them and drops the dead weight — here 256 → 180 experts at the seed_42 / 0.30 setting. The result loads in ~47 GB bf16 (fits comfortably across 3×24 GB GPUs) while keeping the active-parameter compute of the original A3B design.

On top of the pruned base we merged a rank-16 QLoRA trained on a coding + agentic mix, so the model ships ready to write and reason about code rather than needing a separate adapter at serve time.

Lineage

Stage	What	Result
Base	Qwen3.5 MoE (A3B), "Heretic" lineage	256 experts
Prune	REAP `seed_42-0.30`	180 experts, ~24.5B total
Specialize	QLoRA r16 (NF4, FSDP2, 3×RTX 3090) on `coding_fable_mix`	coding/agentic adapter
Ship	LoRA merged into the pruned base (this repo)	standalone bf16 model

Model details

Architecture: Qwen3_5MoeForCausalLM (qwen3_5_moe) — hybrid DeltaNet linear-attention + full-attention layers, MoE FFN with a shared expert.
Experts: 180 (REAP-pruned from 256) · Layers: 40 · Hidden: 2048
Params: ~24.5B total, ~3B active per token
Precision: bf16 · Context: long-context capable (served at 8k here; base supports far more)
Tokenizer / chat template: inherited from the Qwen3.5 base (included)

Specialization (the merged LoRA)

Adapter: LoRA r=16, α=32, dropout=0.05; targets sequence-mixing only (q/k/v/o_proj + DeltaNet in_proj_{qkv,z,b,a} + out_proj) — experts were not adapted.
Data: coding_fable_mix — 10,270 chat rows including agentic-coding traces (~20%).
Recipe: 4-bit NF4 QLoRA, FSDP2 sharded (no CPU offload), Flash-Attention-2, bf16, seq-len 2048, LR 1.2e-4 cosine, effective batch 24, on 3× RTX 3090.
Checkpoint loss: 1.33 (ppl ≈ 3.79).
Merge fidelity: verified weight-exact — for adapted modules W_merged = W_base + (α/r)·B·A (max abs error 2.4e-4, bf16 rounding); all non-adapted weights byte-identical to the base.

Usage

vLLM (recommended — tested pp=3, tp=1 on 3×24 GB)

vllm serve groxaxo/Qwen3.5-24.5B-Reapped-v1 \
  --pipeline-parallel-size 3 --tensor-parallel-size 1 \
  --dtype bfloat16 --max-model-len 8192 \
  --enforce-eager --enable-prefix-caching

Note: the qwen3_5_moe architecture (DeltaNet + MoE) needs a vLLM build with Qwen3.5-MoE support.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

mid = "groxaxo/Qwen3.5-24.5B-Reapped-v1"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, dtype=torch.bfloat16, device_map="auto")

msgs = [{"role": "user", "content": "Write a Python function that reverses the words in a string."}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=256, temperature=0.2)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

This is a reasoning-style model: it may emit a thinking trace before the final answer.

Sanity checks (served via vLLM, pp3/tp1)

Prompt	Response
Reverse the words in a string	`' '.join(reversed(s.split()))` ✅
Train 60 km in 45 min → km/h	80 ✅
Why does `lst[3]` IndexError; fix it	zero-indexed → use `lst[-1]` ✅

Limitations & notes

Inherits the biases and uncensored ("Heretic"-lineage) behavior of the base.
REAP pruning removes expert capacity; expect some regression on tasks far outside the coding/agentic specialization relative to the full 256-expert model.
Only the attention/linear-attention projections were fine-tuned — knowledge stored in experts is the pruned base's.
"v1" — an early specialization checkpoint (2K-context stage). Longer-context continuations are planned.

Acknowledgements

Built on the Qwen3.5 MoE family, slimmed with the REAP expert-pruning method, and specialized with axolotl QLoRA on consumer 3×RTX 3090 hardware. Released by groxaxo.

Downloads last month: 43

Safetensors

Model size

25B params

Tensor type

BF16

Model tree for groxaxo/Qwen3.5-24.5B-Reapped-v1

Quantizations

1 model