Qwen3.5-24.5B-Reapped-v1

A leaner, coding-sharpened Qwen3.5 MoE. This model takes a 35B-class Qwen3.5 Mixture-of-Experts, REAPs away ~30% of its experts to land at ~24.5B total parameters (≈3B active per token), then bakes in a coding/agentic LoRA so the slimmer network punches well above its memory footprint.

Smaller resident weights. Same ~3B active compute per token. A coder's attitude welded on.


Why it exists

Modern MoE models carry a lot of expert capacity you don't always need. REAP (Router-weighted Expert Activation Pruning) ranks experts by how much the router actually relies on them and drops the dead weight — here 256 → 180 experts at the seed_42 / 0.30 setting. The result loads in ~47 GB bf16 (fits comfortably across 3×24 GB GPUs) while keeping the active-parameter compute of the original A3B design.

On top of the pruned base we merged a rank-16 QLoRA trained on a coding + agentic mix, so the model ships ready to write and reason about code rather than needing a separate adapter at serve time.

Lineage

Stage What Result
Base Qwen3.5 MoE (A3B), "Heretic" lineage 256 experts
Prune REAP seed_42-0.30 180 experts, ~24.5B total
Specialize QLoRA r16 (NF4, FSDP2, 3×RTX 3090) on coding_fable_mix coding/agentic adapter
Ship LoRA merged into the pruned base (this repo) standalone bf16 model

Model details

  • Architecture: Qwen3_5MoeForCausalLM (qwen3_5_moe) — hybrid DeltaNet linear-attention + full-attention layers, MoE FFN with a shared expert.
  • Experts: 180 (REAP-pruned from 256) · Layers: 40 · Hidden: 2048
  • Params: ~24.5B total, ~3B active per token
  • Precision: bf16 · Context: long-context capable (served at 8k here; base supports far more)
  • Tokenizer / chat template: inherited from the Qwen3.5 base (included)

Specialization (the merged LoRA)

  • Adapter: LoRA r=16, α=32, dropout=0.05; targets sequence-mixing only (q/k/v/o_proj + DeltaNet in_proj_{qkv,z,b,a} + out_proj) — experts were not adapted.
  • Data: coding_fable_mix — 10,270 chat rows including agentic-coding traces (~20%).
  • Recipe: 4-bit NF4 QLoRA, FSDP2 sharded (no CPU offload), Flash-Attention-2, bf16, seq-len 2048, LR 1.2e-4 cosine, effective batch 24, on 3× RTX 3090.
  • Checkpoint loss: 1.33 (ppl ≈ 3.79).
  • Merge fidelity: verified weight-exact — for adapted modules W_merged = W_base + (α/r)·B·A (max abs error 2.4e-4, bf16 rounding); all non-adapted weights byte-identical to the base.

Usage

vLLM (recommended — tested pp=3, tp=1 on 3×24 GB)

vllm serve groxaxo/Qwen3.5-24.5B-Reapped-v1 \
  --pipeline-parallel-size 3 --tensor-parallel-size 1 \
  --dtype bfloat16 --max-model-len 8192 \
  --enforce-eager --enable-prefix-caching

Note: the qwen3_5_moe architecture (DeltaNet + MoE) needs a vLLM build with Qwen3.5-MoE support.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

mid = "groxaxo/Qwen3.5-24.5B-Reapped-v1"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, dtype=torch.bfloat16, device_map="auto")

msgs = [{"role": "user", "content": "Write a Python function that reverses the words in a string."}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=256, temperature=0.2)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

This is a reasoning-style model: it may emit a thinking trace before the final answer.

Sanity checks (served via vLLM, pp3/tp1)

Prompt Response
Reverse the words in a string ' '.join(reversed(s.split()))
Train 60 km in 45 min → km/h 80
Why does lst[3] IndexError; fix it zero-indexed → use lst[-1]

Limitations & notes

  • Inherits the biases and uncensored ("Heretic"-lineage) behavior of the base.
  • REAP pruning removes expert capacity; expect some regression on tasks far outside the coding/agentic specialization relative to the full 256-expert model.
  • Only the attention/linear-attention projections were fine-tuned — knowledge stored in experts is the pruned base's.
  • "v1" — an early specialization checkpoint (2K-context stage). Longer-context continuations are planned.

Acknowledgements

Built on the Qwen3.5 MoE family, slimmed with the REAP expert-pruning method, and specialized with axolotl QLoRA on consumer 3×RTX 3090 hardware. Released by groxaxo.

Downloads last month
43
Safetensors
Model size
25B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/Qwen3.5-24.5B-Reapped-v1

Quantizations
1 model