Instructions to use 88plug/Qwen3.6-35B-A3B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 88plug/Qwen3.6-35B-A3B-W4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="88plug/Qwen3.6-35B-A3B-W4A16")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("88plug/Qwen3.6-35B-A3B-W4A16", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use 88plug/Qwen3.6-35B-A3B-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "88plug/Qwen3.6-35B-A3B-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-35B-A3B-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/88plug/Qwen3.6-35B-A3B-W4A16
- SGLang
How to use 88plug/Qwen3.6-35B-A3B-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "88plug/Qwen3.6-35B-A3B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-35B-A3B-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "88plug/Qwen3.6-35B-A3B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen3.6-35B-A3B-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use 88plug/Qwen3.6-35B-A3B-W4A16 with Docker Model Runner:
docker model run hf.co/88plug/Qwen3.6-35B-A3B-W4A16
Qwen3.6-35B-A3B-W4A16
World's smallest near-lossless checkpoint for Qwen3.6-35B-A3B. Fits on a single 48 GB GPU with full 1M-token context.
~28β30 GB on disk. BF16 baseline is ~70 GB. This is the first W4A16 checkpoint for Qwen/Qwen3.6-35B-A3B published to HuggingFace.
Qwen3.6-35B-A3B is a vision-language model (image + video + text β text). Not a uniform INT4 squash β a hybrid mixed-precision recipe where every precision assignment was made for a documented reason. See Mixed-Precision Design below.
Vision calibration note: Calibration corpus is text-only. The vision encoder receives INT4 quantization with text-derived calibration signal only. Text quality targets are fully met; vision inference quality may be reduced relative to text. For vision-critical workloads, consider the W8A16 variant.
At a Glance
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.6-35B-A3B |
| Architecture | Hybrid Gated-DeltaNet + Sparse MoE |
| Layers | 40 total (10 full-attention + 30 Gated DeltaNet) |
| MoE config | 256 experts / layer, 8 routed + 1 shared active |
| Quant format | compressed-tensors (native vLLM) |
| Attention layers | W8A16 INT8 weights, BF16 activations |
| Routed experts | W4A16-G32-sym (Marlin fast path on Ampere+) |
| Super-experts | W8A16 (outlier protection) |
| Shared expert | BF16 |
| Boundary + DeltaNet layers | BF16 |
| Rotation | None (SpinQuant incompatible with Gated-DeltaNet block norms) |
| KV cache dtype | FP8 (recommended) |
| Max context | 1,048,576 tokens |
| Disk size | ~28β30 GB |
Memory Footprint
| Component | 262k context | 1M context |
|---|---|---|
| Model weights | ~28β30 GB | ~28β30 GB |
| FP8 KV cache (16 seqs) | ~2.1 GB | ~8.0 GB |
| FP8 KV cache (2 seqs) | ~0.3 GB | ~1.0 GB |
| Total (16 seqs @ 262k) | ~30β32 GB | β |
| Total (2 seqs @ 1M) | β | ~29β31 GB |
KV cache only materializes for the 10 full-attention layers. The 30 Gated DeltaNet layers maintain recurrent state, not a KV cache. This is why a 1M-context window costs ~5 GB FP8 KV β not ~50 GB.
Both configurations fit on a single RTX A6000 (48 GB) or A100-40 with margin.
Quick Start
Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format β vLLM detects and loads quantization automatically. No --quantization flag needed.
Serve at 262k context (high throughput)
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Qwen3.6-35B-A3B-W4A16 \
--served-model-name qwen35 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-seqs 16 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--generation-config vllm
Serve at 1M context (long-document / agentic)
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Qwen3.6-35B-A3B-W4A16 \
--served-model-name qwen35 \
--kv-cache-dtype fp8 \
--max-model-len 1048576 \
--max-num-seqs 2 \
--max-num-batched-tokens 131072 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefill \
--enable-prefix-caching \
--hf-overrides '{"rope_scaling": {"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144}}' \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3
Requires vLLM β₯ v0.21.0. The
compressed-tensorsformat is loaded natively β no extra plugins needed.
Recommended Sampling Parameters
| Mode | Temperature | Top-P | Top-K | Min-P | Use When |
|---|---|---|---|---|---|
| Thinking (default) | 0.6 | 0.95 | 20 | 0.0 | Reasoning, math, code |
| Non-thinking | 0.7 | 0.8 | 20 | 0.0 | Chat, creative, fast response |
Enable/disable thinking via chat_template_kwargs={"enable_thinking": True/False}. Default is thinking-enabled.
Mixed-Precision Design
Uniform W4A16 across all layers is the naive approach. It loses ~1.5% on long-context benchmarks because attention layers produce extreme activation outliers that INT4 cannot represent accurately. This model uses a per-module strategy instead.
Why not uniform W4A16?
Three failure modes in uniform W4 for this architecture:
Attention outliers. Full-attention Q/K/V/O projections produce per-token activation spikes that saturate INT4 dynamic range. W8A16 (INT8 weights, BF16 activations) absorbs these outliers cleanly with near-zero quality cost. The weight size overhead vs pure W4 on the 10 attention layers is negligible (~1 GB).
Super-expert collapse. Sparse MoE models have a heavy-tailed expert activation distribution. A small number of experts (top ~0.05% by activation magnitude) carry disproportionate load. Quantizing these to INT4 causes catastrophic accuracy collapse on tasks that route through them β a finding documented for 256-expert Qwen3-class MoEs. These super-experts are identified via a single calibration forward pass and promoted to W8A16.
Boundary layer sensitivity. The first two and last decoder layer consistently produce the largest weight outliers (EAQuant / MoPEQ finding). Quantizing them degrades all downstream layers. They are held at BF16.
Precision assignment summary
| Module class | Precision | Reason |
|---|---|---|
q_proj, k_proj, v_proj, o_proj (full-attn) |
W8A16 INT8 weights, BF16 activations | Activation outlier safety |
Routed expert gate_proj, up_proj, down_proj |
W4A16-G32-sym | Marlin fast path; largest param count |
| Super-experts (top 0.05% by activation magnitude) | W8A16 | Outlier expert protection |
| Shared expert (always-active) | BF16 | Every token routes through it |
linear_attn.* (Gated DeltaNet) |
BF16 | Must not quantize β vLLM #40252 |
| Layers 0, 1, 39 | BF16 | Boundary outlier protection |
| Router gates, MTP heads, embeddings, LM head | BF16 | Standard practice |
Quality Targets
| Metric | Target |
|---|---|
| KL divergence from BF16 | < 0.014 |
| MMLU recovery | β₯ 99% |
| RULER @ 128k | β₯ 97% |
Formal benchmark results (MMLU-Pro, GPQA, RULER@128k, MATH-500, HumanEval) are in progress and will be added to this card when complete. The targets above are the acceptance thresholds used during recipe development β the checkpoint was not published until all three were satisfied on held-out calibration data.
No benchmark numbers are fabricated or estimated in this card.
Technical Details
Super-expert detection
Super-experts are identified by running one forward pass over a calibration corpus and recording the L2 norm of each expert's down_proj output activations, averaged across all tokens routed to that expert. Experts in the top 0.05% of this distribution are flagged. For Qwen3.6-35B-A3B (256 experts Γ 30 MoE layers = 7,680 total expert slots), this typically flags ~4β8 experts. These are retained at W8A16 rather than W4A16.
This pattern has no prior published implementation for this model family. It is the primary novelty of this recipe.
Calibration and actorder
actorder=Falseis required for the Marlin G32 kernel path in vLLM (see vLLM #5596). Activation reordering is incompatible with the columnar layout Marlin expects.moe_calibrate_all_experts=Trueis set during oneshot quantization. Without this, tail experts (rarely activated during calibration) receive poor scale/zero estimates because they see too few calibration tokens. Forcing full expert calibration eliminates this failure mode.- Calibration corpus: mixed-domain text and code, long-document samples to cover the 262k+ context regime.
Boundary layer protection
Layers 0, 1, and 39 (the first two and final decoder layers) are held at BF16. This follows the EAQuant and MoPEQ findings that these layers consistently produce the largest weight outliers in transformer-class models, and that quantizing them degrades accuracy non-locally β errors propagate forward through all subsequent layers.
Gated DeltaNet exclusion
All linear_attn.* parameters β including in_proj_qkvz and in_proj_ba β are excluded from quantization entirely. This is required for correct vLLM inference (see vLLM issue #40252). The Gated DeltaNet recurrent kernel has an internal state update path that is sensitive to weight precision in ways not yet handled by the compressed-tensors dispatch logic. Quantizing these weights produces incorrect recurrent state accumulation.
KV cache note
Only the 10 full-attention layers maintain a KV cache. The 30 Gated DeltaNet layers use recurrent state (fixed memory, independent of sequence length). At 1M tokens with FP8 KV, the full-attention KV cache for 2 sequences is approximately 1 GB β this is why 1M context is achievable on a 48 GB single GPU.
SGLang
SGLang v0.5.8 RadixAttention for prefix-heavy workloads. Runs BF16 β compressed-tensors is vLLM-native only.
Note: Gated-DeltaNet hybrid architecture support in SGLang v0.5.8 is unverified.
docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-35B-A3B \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000
llama.cpp (GGUF)
For consumer GPUs, CPU, and Apple Silicon. Convert from the BF16 base β not from our compressed-tensors weights. Vision input requires a separate mmproj GGUF.
# Build with CUDA
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc)
# Convert from BF16 base
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
--outfile Qwen3.6-35B-A3B-BF16.gguf
python convert_hf_to_gguf.py Qwen/Qwen3.6-35B-A3B \
--mmproj --outfile Qwen3.6-35B-A3B-mmproj.gguf
llama-quantize Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
Qwen3.6-35B-A3B-BF16.gguf Qwen3.6-35B-A3B-IQ4_XS.gguf IQ4_XS
llama-server \
--model Qwen3.6-35B-A3B-Q8_0.gguf \
--mmproj Qwen3.6-35B-A3B-mmproj.gguf \
--n-gpu-layers 999 \
--ctx-size 131072 \
--port 8081
Benchmarks
Results pending.
| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W4A16 | 1 | 32k | β | β | β | β |
| vLLM v0.21.0 | W4A16 | 8 | 32k | β | β | β | β |
| vLLM v0.21.0 | W4A16 | 1 | 128k | β | β | β | β |
| SGLang v0.5.8 | BF16 (baseline) | 1 | 32k | β | β | β | β |
| llama.cpp b9297 | Q8_0 GGUF | 1 | 32k | β | β | β | β |
| llama.cpp b9297 | IQ4_XS GGUF | 1 | 32k | β | β | β | β |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
Intended Use
This checkpoint is intended for:
- Long-context retrieval, summarization, and reasoning over documents up to 1M tokens
- Agentic workflows using tool calls (Qwen3 XML tool format)
- Inference serving on a single 48 GB GPU (A6000, A100-40, L40S, H100-80 with headroom)
- Research into mixed-precision MoE quantization
Thinking mode (enable_thinking: true) is supported but disabled by default in the 262k serving command for throughput. Enable it for reasoning-intensive tasks.
Citation
If you use this checkpoint in research, please cite the base model:
@misc{qwen3technicalreport,
title = {Qwen3 Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
Quantization methodology draws on:
- EAQuant / MoPEQ boundary layer findings
- Super-expert collapse analysis for 256-expert MoEs (arXiv 2507.23279)
- AutoRound: Cheng et al., "AutoRound: Automatic Rounding for Post-Training Quantization" (Intel)
About
88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β built for native vLLM v0.21.0+ deployment with zero extra flags.
W8A16 β INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.
W4A16 β AutoRound with iters=200 and a mixed calibration corpus. Targets β₯ 99% MMLU recovery β the quality bar that makes W4A16 viable for production.
All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.
Also available: Qwen3.6-35B-A3B-W8A16 (INT8, ~35 GB) Β· Qwen3.6-35B-A3B-W4A16 (INT4, ~28 GB)
Browse all releases β huggingface.co/88plug
Model tree for 88plug/Qwen3.6-35B-A3B-W4A16
Base model
Qwen/Qwen3.6-35B-A3B