Qwen3.6-35B-A3B-W4A16

World's smallest near-lossless checkpoint for Qwen3.6-35B-A3B. Fits on a single 48 GB GPU with full 1M-token context.

~28–30 GB on disk. BF16 baseline is ~70 GB. This is the first W4A16 checkpoint for Qwen/Qwen3.6-35B-A3B published to HuggingFace.

Qwen3.6-35B-A3B is a vision-language model (image + video + text β†’ text). Not a uniform INT4 squash β€” a hybrid mixed-precision recipe where every precision assignment was made for a documented reason. See Mixed-Precision Design below.

Vision calibration note: Calibration corpus is text-only. The vision encoder receives INT4 quantization with text-derived calibration signal only. Text quality targets are fully met; vision inference quality may be reduced relative to text. For vision-critical workloads, consider the W8A16 variant.


At a Glance

Property Value
Base model Qwen/Qwen3.6-35B-A3B
Architecture Hybrid Gated-DeltaNet + Sparse MoE
Layers 40 total (10 full-attention + 30 Gated DeltaNet)
MoE config 256 experts / layer, 8 routed + 1 shared active
Quant format compressed-tensors (native vLLM)
Attention layers W8A16 INT8 weights, BF16 activations
Routed experts W4A16-G32-sym (Marlin fast path on Ampere+)
Super-experts W8A16 (outlier protection)
Shared expert BF16
Boundary + DeltaNet layers BF16
Rotation None (SpinQuant incompatible with Gated-DeltaNet block norms)
KV cache dtype FP8 (recommended)
Max context 1,048,576 tokens
Disk size ~28–30 GB

Memory Footprint

Component 262k context 1M context
Model weights ~28–30 GB ~28–30 GB
FP8 KV cache (16 seqs) ~2.1 GB ~8.0 GB
FP8 KV cache (2 seqs) ~0.3 GB ~1.0 GB
Total (16 seqs @ 262k) ~30–32 GB β€”
Total (2 seqs @ 1M) β€” ~29–31 GB

KV cache only materializes for the 10 full-attention layers. The 30 Gated DeltaNet layers maintain recurrent state, not a KV cache. This is why a 1M-context window costs ~5 GB FP8 KV β€” not ~50 GB.

Both configurations fit on a single RTX A6000 (48 GB) or A100-40 with margin.


Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format β€” vLLM detects and loads quantization automatically. No --quantization flag needed.

Serve at 262k context (high throughput)

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-35B-A3B-W4A16 \
  --served-model-name qwen35 \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --generation-config vllm

Serve at 1M context (long-document / agentic)

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen3.6-35B-A3B-W4A16 \
  --served-model-name qwen35 \
  --kv-cache-dtype fp8 \
  --max-model-len 1048576 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 131072 \
  --gpu-memory-utilization 0.95 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --hf-overrides '{"rope_scaling": {"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144}}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3

Requires vLLM β‰₯ v0.21.0. The compressed-tensors format is loaded natively β€” no extra plugins needed.

Recommended Sampling Parameters

Mode Temperature Top-P Top-K Min-P Use When
Thinking (default) 0.6 0.95 20 0.0 Reasoning, math, code
Non-thinking 0.7 0.8 20 0.0 Chat, creative, fast response

Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.


Mixed-Precision Design

Uniform W4A16 across all layers is the naive approach. It loses ~1.5% on long-context benchmarks because attention layers produce extreme activation outliers that INT4 cannot represent accurately. This model uses a per-module strategy instead.

Why not uniform W4A16?

Three failure modes in uniform W4 for this architecture:

  1. Attention outliers. Full-attention Q/K/V/O projections produce per-token activation spikes that saturate INT4 dynamic range. W8A16 (INT8 weights, BF16 activations) absorbs these outliers cleanly with near-zero quality cost. The weight size overhead vs pure W4 on the 10 attention layers is negligible (~1 GB).

  2. Super-expert collapse. Sparse MoE models have a heavy-tailed expert activation distribution. A small number of experts (top ~0.05% by activation magnitude) carry disproportionate load. Quantizing these to INT4 causes catastrophic accuracy collapse on tasks that route through them β€” a finding documented for 256-expert Qwen3-class MoEs. These super-experts are identified via a single calibration forward pass and promoted to W8A16.

  3. Boundary layer sensitivity. The first two and last decoder layer consistently produce the largest weight outliers (EAQuant / MoPEQ finding). Quantizing them degrades all downstream layers. They are held at BF16.

Precision assignment summary

Module class Precision Reason
q_proj, k_proj, v_proj, o_proj (full-attn) W8A16 INT8 weights, BF16 activations Activation outlier safety
Routed expert gate_proj, up_proj, down_proj W4A16-G32-sym Marlin fast path; largest param count
Super-experts (top 0.05% by activation magnitude) W8A16 Outlier expert protection
Shared expert (always-active) BF16 Every token routes through it
linear_attn.* (Gated DeltaNet) BF16 Must not quantize β€” vLLM #40252
Layers 0, 1, 39 BF16 Boundary outlier protection
Router gates, MTP heads, embeddings, LM head BF16 Standard practice

Quality Targets

Metric Target
KL divergence from BF16 < 0.014
MMLU recovery β‰₯ 99%
RULER @ 128k β‰₯ 97%

Formal benchmark results (MMLU-Pro, GPQA, RULER@128k, MATH-500, HumanEval) are in progress and will be added to this card when complete. The targets above are the acceptance thresholds used during recipe development β€” the checkpoint was not published until all three were satisfied on held-out calibration data.

No benchmark numbers are fabricated or estimated in this card.


Technical Details

Super-expert detection

Super-experts are identified by running one forward pass over a calibration corpus and recording the L2 norm of each expert's down_proj output activations, averaged across all tokens routed to that expert. Experts in the top 0.05% of this distribution are flagged. For Qwen3.6-35B-A3B (256 experts Γ— 30 MoE layers = 7,680 total expert slots), this typically flags ~4–8 experts. These are retained at W8A16 rather than W4A16.

This pattern has no prior published implementation for this model family. It is the primary novelty of this recipe.

Calibration and actorder

  • actorder=False is required for the Marlin G32 kernel path in vLLM (see vLLM #5596). Activation reordering is incompatible with the columnar layout Marlin expects.
  • moe_calibrate_all_experts=True is set during oneshot quantization. Without this, tail experts (rarely activated during calibration) receive poor scale/zero estimates because they see too few calibration tokens. Forcing full expert calibration eliminates this failure mode.
  • Calibration corpus: mixed-domain text and code, long-document samples to cover the 262k+ context regime.

Boundary layer protection

Layers 0, 1, and 39 (the first two and final decoder layers) are held at BF16. This follows the EAQuant and MoPEQ findings that these layers consistently produce the largest weight outliers in transformer-class models, and that quantizing them degrades accuracy non-locally β€” errors propagate forward through all subsequent layers.

Gated DeltaNet exclusion

All linear_attn.* parameters β€” including in_proj_qkvz and in_proj_ba β€” are excluded from quantization entirely. This is required for correct vLLM inference (see vLLM issue #40252). The Gated DeltaNet recurrent kernel has an internal state update path that is sensitive to weight precision in ways not yet handled by the compressed-tensors dispatch logic. Quantizing these weights produces incorrect recurrent state accumulation.

KV cache note

Only the 10 full-attention layers maintain a KV cache. The 30 Gated DeltaNet layers use recurrent state (fixed memory, independent of sequence length). At 1M tokens with FP8 KV, the full-attention KV cache for 2 sequences is approximately 1 GB β€” this is why 1M context is achievable on a 48 GB single GPU.


SGLang

SGLang v0.5.8 RadixAttention for prefix-heavy workloads. Runs BF16 β€” compressed-tensors is vLLM-native only.

Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp (GGUF)

For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base β€” not from our compressed-tensors weights. Vision input requires a separate mmproj GGUF.

# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)

# Convert from BF16 base
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
  --outfile Qwen3.6-35B-A3B-BF16.gguf
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
  --mmproj --outfile Qwen3.6-35B-A3B-mmproj.gguf

llama-quantize Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Qwen3.6-35B-A3B-Q8_0.gguf \
  --mmproj Qwen3.6-35B-A3B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 131072 \
  --port 8081

Benchmarks

Results pending.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W4A16 1 32k β€” β€” β€” β€”
vLLM v0.21.0 W4A16 8 32k β€” β€” β€” β€”
vLLM v0.21.0 W4A16 1 128k β€” β€” β€” β€”
SGLang v0.5.8 BF16 (baseline) 1 32k β€” β€” β€” β€”
llama.cpp b9297 Q8_0 GGUF 1 32k β€” β€” β€” β€”
llama.cpp b9297 IQ4_XS GGUF 1 32k β€” β€” β€” β€”

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


Intended Use

This checkpoint is intended for:

  • Long-context retrieval, summarization, and reasoning over documents up to 1M tokens
  • Agentic workflows using tool calls (Qwen3 XML tool format)
  • Inference serving on a single 48 GB GPU (A6000, A100-40, L40S, H100-80 with headroom)
  • Research into mixed-precision MoE quantization

Thinking mode (enable_thinking: true) is supported but disabled by default in the 262k serving command for throughput. Enable it for reasoning-intensive tasks.


Citation

If you use this checkpoint in research, please cite the base model:

@misc{qwen3technicalreport,
  title  = {Qwen3 Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}

Quantization methodology draws on:

  • EAQuant / MoPEQ boundary layer findings
  • Super-expert collapse analysis for 256-expert MoEs (arXiv 2507.23279)
  • AutoRound: Cheng et al., "AutoRound: Automatic Rounding for Post-Training Quantization" (Intel)

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β€” built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 β€” INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 β€” AutoRound with iters=200 and a mixed calibration corpus. Targets β‰₯ 99% MMLU recovery β€” the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen3.6-35B-A3B-W8A16 (INT8, ~35 GB) Β· Qwen3.6-35B-A3B-W4A16 (INT4, ~28 GB)

Browse all releases β†’ huggingface.co/88plug

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 88plug/Qwen3.6-35B-A3B-W4A16

Quantized
(396)
this model