Cinder — Qwen3.6-35B-A3B (abliterated, NVFP4)

Cinder is the NVFP4 quantization of Ember — the abliterated (refusal-removed) build of Qwen/Qwen3.6-35B-A3B. Same surgical abliteration, 3× smaller: **22 GB** vs ~66 GB for the BF16 Ember.

For the full method writeup, retention evidence, and the BF16 weights, see Ember. The patch + method: heretic-fused-moe-abliteration.

Not affiliated with NVIDIA or the Apache Software Foundation. Independent community model.

What it is

  • Format: NVFP4 via compressed-tensors / llm-compressor. FP4 weights with FP8 block scales, NVFP4 activation scheme.
  • Hardware: needs an NVIDIA Blackwell GPU (sm_120 / sm_121 — e.g. RTX 50-series, DGX Spark / GB10) and a recent vLLM with NVFP4 support. It will not run on older GPUs. If you're on anything pre-Blackwell, use Ember (BF16) and quantize to your own format.
  • ~22 GB on disk — fits comfortably in the DGX Spark's unified memory with room for a long context and a speculative drafter.

Quantization details (and what was deliberately not quantized)

The fused MoE experts are FP4-packed; the hybrid layers are preserved in BF16. Verified post-quant:

  • 30,720 expert weight tensors FP4-packed, 0 experts silently left in BF16 (the fused-expert handling carried through quantization).
  • The 30 linear-attention (Mamba/GDN) layers stayed BF16 — quantizing them breaks the model; they're in the ignore list (linear_attn, mlp.gate, shared_expert_gate, embed_tokens, lm_head, vision tower).
  • Quant scales clean, no NaNs.

Quant recipe ships in recipe.yaml.

Usage (vLLM, Blackwell)

vllm serve <path-to-cinder> \
  --quantization compressed-tensors \
  --max-model-len 131072 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
  --trust-remote-code
  • Vision-language (image-text-to-text) — image input works; vision tower is BF16, untouched by quant.
  • Thinking via chat_template_kwargs: {"enable_thinking": false} per request.
  • Pairs with the public z-lab DFlash drafter for ~1.5× decode speedup via speculative decoding (not included).

Safety

Refusal behavior is removed (same as Ember). You own the guardrails. Research / red-team / operator-controlled use.

License & attribution

  • License: Apache 2.0 (inherited from base). See LICENSE / NOTICE. Modified from Qwen3.6-35B-A3B (abliteration + NVFP4 quantization).
  • Base: Qwen/Qwen3.6-35B-A3B (Apache 2.0), © the Qwen team.
  • Abliteration: built on Heretic (Philipp Emanuel Weidmann) + a fused-MoE patch (see Ember).
  • Quantization: llm-compressor (NVFP4).

The smaller, hardier cousin of Ember — forged by Sparky on a DGX Spark. A cinder: what's left when the ember has done its work, and it still burns. 🔥

Downloads last month
58
Safetensors
Model size
21B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SparkyForge/Cinder

Quantized
(526)
this model