Instructions to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="groxaxo/Qwen3.5-24.5B-Reapped-v1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3.5-24.5B-Reapped-v1") model = AutoModelForCausalLM.from_pretrained("groxaxo/Qwen3.5-24.5B-Reapped-v1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "groxaxo/Qwen3.5-24.5B-Reapped-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Qwen3.5-24.5B-Reapped-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/groxaxo/Qwen3.5-24.5B-Reapped-v1
- SGLang
How to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "groxaxo/Qwen3.5-24.5B-Reapped-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Qwen3.5-24.5B-Reapped-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "groxaxo/Qwen3.5-24.5B-Reapped-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Qwen3.5-24.5B-Reapped-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use groxaxo/Qwen3.5-24.5B-Reapped-v1 with Docker Model Runner:
docker model run hf.co/groxaxo/Qwen3.5-24.5B-Reapped-v1
Qwen3.5-24.5B-Reapped-v1
A leaner, coding-sharpened Qwen3.5 MoE. This model takes a 35B-class Qwen3.5 Mixture-of-Experts, REAPs away ~30% of its experts to land at ~24.5B total parameters (≈3B active per token), then bakes in a coding/agentic LoRA so the slimmer network punches well above its memory footprint.
Smaller resident weights. Same ~3B active compute per token. A coder's attitude welded on.
Why it exists
Modern MoE models carry a lot of expert capacity you don't always need. REAP (Router-weighted
Expert Activation Pruning) ranks experts by how much the router actually relies on them and drops the
dead weight — here 256 → 180 experts at the seed_42 / 0.30 setting. The result loads in ~47 GB
bf16 (fits comfortably across 3×24 GB GPUs) while keeping the active-parameter compute of the
original A3B design.
On top of the pruned base we merged a rank-16 QLoRA trained on a coding + agentic mix, so the model ships ready to write and reason about code rather than needing a separate adapter at serve time.
Lineage
| Stage | What | Result |
|---|---|---|
| Base | Qwen3.5 MoE (A3B), "Heretic" lineage | 256 experts |
| Prune | REAP seed_42-0.30 |
180 experts, ~24.5B total |
| Specialize | QLoRA r16 (NF4, FSDP2, 3×RTX 3090) on coding_fable_mix |
coding/agentic adapter |
| Ship | LoRA merged into the pruned base (this repo) | standalone bf16 model |
Model details
- Architecture:
Qwen3_5MoeForCausalLM(qwen3_5_moe) — hybrid DeltaNet linear-attention + full-attention layers, MoE FFN with a shared expert. - Experts: 180 (REAP-pruned from 256) · Layers: 40 · Hidden: 2048
- Params: ~24.5B total, ~3B active per token
- Precision: bf16 · Context: long-context capable (served at 8k here; base supports far more)
- Tokenizer / chat template: inherited from the Qwen3.5 base (included)
Specialization (the merged LoRA)
- Adapter: LoRA r=16, α=32, dropout=0.05; targets sequence-mixing only
(
q/k/v/o_proj+ DeltaNetin_proj_{qkv,z,b,a}+out_proj) — experts were not adapted. - Data:
coding_fable_mix— 10,270 chat rows including agentic-coding traces (~20%). - Recipe: 4-bit NF4 QLoRA, FSDP2 sharded (no CPU offload), Flash-Attention-2, bf16, seq-len 2048, LR 1.2e-4 cosine, effective batch 24, on 3× RTX 3090.
- Checkpoint loss: 1.33 (ppl ≈ 3.79).
- Merge fidelity: verified weight-exact — for adapted modules
W_merged = W_base + (α/r)·B·A(max abs error 2.4e-4, bf16 rounding); all non-adapted weights byte-identical to the base.
Usage
vLLM (recommended — tested pp=3, tp=1 on 3×24 GB)
vllm serve groxaxo/Qwen3.5-24.5B-Reapped-v1 \
--pipeline-parallel-size 3 --tensor-parallel-size 1 \
--dtype bfloat16 --max-model-len 8192 \
--enforce-eager --enable-prefix-caching
Note: the
qwen3_5_moearchitecture (DeltaNet + MoE) needs a vLLM build with Qwen3.5-MoE support.
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
mid = "groxaxo/Qwen3.5-24.5B-Reapped-v1"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, dtype=torch.bfloat16, device_map="auto")
msgs = [{"role": "user", "content": "Write a Python function that reverses the words in a string."}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=256, temperature=0.2)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
This is a reasoning-style model: it may emit a thinking trace before the final answer.
Sanity checks (served via vLLM, pp3/tp1)
| Prompt | Response |
|---|---|
| Reverse the words in a string | ' '.join(reversed(s.split())) ✅ |
| Train 60 km in 45 min → km/h | 80 ✅ |
Why does lst[3] IndexError; fix it |
zero-indexed → use lst[-1] ✅ |
Limitations & notes
- Inherits the biases and uncensored ("Heretic"-lineage) behavior of the base.
- REAP pruning removes expert capacity; expect some regression on tasks far outside the coding/agentic specialization relative to the full 256-expert model.
- Only the attention/linear-attention projections were fine-tuned — knowledge stored in experts is the pruned base's.
- "v1" — an early specialization checkpoint (2K-context stage). Longer-context continuations are planned.
Acknowledgements
Built on the Qwen3.5 MoE family, slimmed with the REAP expert-pruning method, and specialized with axolotl QLoRA on consumer 3×RTX 3090 hardware. Released by groxaxo.
- Downloads last month
- 43