Qwen3.6-27B — Native TQ3 Checkpoint (15.37 GB)

Native 3-bit TurboQuant checkpoint of Qwen/Qwen3.6-27B. Dense weights stored as packed 3-bit indices with per-group norms. 15.37 GB on disk instead of ~54 GB BF16 (3.6× compression).

This is the dense 27B sibling of Qwen3.6-35B-A3B — same Qwen3.6 family, same plugin code path, no MoE: every layer is active on every token. 64 hidden layers in a hybrid attention pattern: 16 full-attention layers + 48 GatedDeltaNet linear-attention layers, plus a 1-layer MTP head. partial_rotary_factor=0.25 (the v0.13.0 block-diagonal WHT CUDA path handles the partial-rotary projections).

Usage

Requires turboquant-plus-vllm v0.13.5 or later and vLLM 0.20.2+.

pip install vllm>=0.20.2
pip install 'turboquant-plus-vllm@git+https://github.com/varjoranta/turboquant-vllm.git'

vllm serve varjosoft/Qwen3.6-27B-TQ3-native \
    --quantization turboquant \
    --trust-remote-code \
    --max-model-len 4096

Python:

from vllm import LLM, SamplingParams

llm = LLM(
    model="varjosoft/Qwen3.6-27B-TQ3-native",
    quantization="turboquant",
    trust_remote_code=True,
    max_model_len=4096,
)
out = llm.generate(
    ["Explain quantum entanglement in one paragraph."],
    SamplingParams(temperature=0, max_tokens=200),
)
print(out[0].outputs[0].text)

Results

Validated end-to-end (save_tq3 → load → generate → eager bench → graphs-on bench) on two hardware classes:

Hardware	vLLM	Mode	Throughput at bs=1
RTX PRO 6000 Blackwell 96GB (sm_120)	0.21.0	CUDA graphs ON	14.46 tok/s
RTX PRO 6000 Blackwell 96GB (sm_120)	0.21.0	eager	13.93 tok/s
A100 80GB (sm_80)	0.20.2	eager	5.4 tok/s

Cross-arch validated: same checkpoint runs on Blackwell sm_120 and Ampere sm_80 with no code change. Blackwell gives ~2.6× over Ampere for this dense+hybrid model.

Graphs-vs-eager delta is only ~4% on this model (vs ~1.6× on the MoE 35B-A3B sister). DeltaNet linear-attention amortizes fewer per-step kernel launches than MoE expert dispatch, so CUDA graphs help less in absolute terms.

GSM8K eval is pending — the eval harness exceeds the validation supervisor's polling window. Will be added as a follow-up when run standalone.

What this checkpoint contains

model-0000{1..3}-of-00003.safetensors — packed 3-bit weight indices (.tq_packed) and per-group norms (.tq_norms) for all Linear layers across the 64 hybrid blocks + 1 MTP module. FP16 retained for embeddings, RMSNorms, biases, and GatedDeltaNet conv1d/A_log/dt_bias tensors (skip-patterned in the plugin to preserve sequence-state precision).
tq_config.json — {"bits": 3, "group_size": 128, "format": "tq3_native"}.
config.json, chat_template.jinja, tokenizer.json, tokenizer_config.json, preprocessor_config.json, generation_config.json — standard HuggingFace artifacts copied verbatim from the source.

How it was made

from turboquant_vllm.checkpoint import save_tq3_checkpoint

save_tq3_checkpoint("Qwen/Qwen3.6-27B", "./qwen3.6-27b-tq3", bits=3)
# CPU only, ~60 GB RAM during compression. No GPU needed.

Each weight tensor is read lazily from the source safetensors, rotated with a Walsh-Hadamard transform, quantized to 3 bits against a Gaussian Lloyd-Max codebook, and saved with per-group norms.

Non-weight tensors (embeddings, norms, biases, GatedDeltaNet state tensors) stay in FP16. The plugin's _SKIP_PATTERNS set includes conv1d so the DeltaNet recurrent-state convolution kernels keep their numerical precision.

Architecture specifics

Qwen3.6-27B is the dense member of the Qwen3.6 family released alongside the MoE 35B-A3B variant. Three patterns the loader handles:

Hybrid attention layout — 16 full-attention layers interleaved with 48 GatedDeltaNet linear-attention layers (64 total). All weights compress through the same per-Linear TQ3 path; the conv1d state tensors inside DeltaNet are skip-patterned (memory plugin_gdn_delta_attn_gap, resolved in PR #31).
Partial-rotary attention (partial_rotary_factor=0.25) — uses the block-diagonal WHT CUDA kernel from v0.13.0 to dequant the rotary projections without falling back to Python.
MTP (Multi-Token Prediction) head — 1-layer MTP module; its Linears compress through the same path as the main stack.

GQA: 24 query heads, 4 KV heads. hidden_size 5120.

Algorithm

Inspired by TurboQuant (Zandieh, Daliri, Hadian, Mirrokni; ICLR 2026). Our implementation uses a Gaussian Lloyd-Max codebook as an approximation of the paper's distortion-rate framework. Norm correction stores original_norm / reconstruction_norm per group to fix magnitude shrinkage at 3-bit.

The weight scheme matches the scalar case of HIGGS (Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh; NAACL 2025); the reference implementation is in HuggingFace Transformers as HiggsConfig. This package's role is the production vLLM integration plus the supporting MoE / hybrid-architecture infrastructure.

Citation

@inproceedings{malinovskii2025higgs,
  title={Pushing the Limits of {LLM} Quantization via the Linearity Theorem},
  author={Malinovskii, Vladimir and Panferov, Andrei and Ilin, Ivan and Guo, Han and Richtárik, Peter and Alistarh, Dan},
  booktitle={Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics},
  year={2025}
}

@inproceedings{zandieh2026turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  booktitle={International Conference on Learning Representations},
  year={2026}
}

Compressed by Varjosoft Oy using turboquant-plus-vllm v0.13.5.