Qwen3.5-27B-RotorQuant-2bit

2-bit KV cache compression for Qwen/Qwen3.5-27B using RotorQuant.

This is a KV-cache-only repository. It contains no model weight files β€” only the configuration and model card for applying RotorQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.

Overview

Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.

RotorQuant applies rotation-based isotropic quantization to the KV cache, achieving better quality and speed than standard quantization approaches at the same bit width.

RotorQuant Advantages

Metric RotorQuant 2-bit Standard 2-bit
Prefill speed 5.3x faster Baseline
Decode speed 28% faster Baseline
Perplexity 6.91 7.07

RotorQuant achieves lower perplexity (better quality) while also being faster β€” a rare combination at aggressive quantization levels.

Specifications

Property Value
Base model Qwen/Qwen3.5-27B
Parameters 27B
Architecture Hybrid Transformer
Native context 262,144 tokens
Thinking mode Yes
KV cache method RotorQuant 2-bit (IsoQuant)
KV cache compression ~10x vs FP16
Weights Original (FP16/BF16, loaded separately)

Memory Estimates

Component Estimate
Model weights (BF16) ~54 GB
KV cache at 128K context (2-bit RotorQuant) ~1.3 GB
KV cache at 128K context (FP16, baseline) ~12.8 GB

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model_id = "Qwen/Qwen3.5-27B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Apply 2-bit RotorQuant KV cache compression
cache = IsoQuantCache(bits=2)

messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=2048,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quality Notes

  • 2-bit is aggressive quantization, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
  • Best suited for memory-constrained scenarios where fitting long-context inference on limited hardware is essential.
  • For higher quality with moderate compression, consider 4-bit KV cache variants.
  • Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.

References

See Also

Variants in this family

(Showing 16 sibling variants under majentik/qwen3.5-27b-*. The current variant β€” RotorQuant-2bit β€” is bolded.)

Variant Runtime Approx size Use case
RotorQuant runtime modifier n/a KV-cache root (weight-agnostic)
RotorQuant-2bit transformers n/a Standalone 2-bit weights
RotorQuant-GGUF-IQ4_XS llama.cpp ~23 GB Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K llama.cpp ~16 GB Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M llama.cpp ~21 GB Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M llama.cpp ~30 GB Balanced default
RotorQuant-GGUF-Q5_K_M llama.cpp ~36 GB Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0 llama.cpp ~57 GB Near-lossless reference
RotorQuant-MLX-2bit mlx-lm ~8.6 GB Apple Silicon, smallest
RotorQuant-MLX-4bit mlx-lm ~17 GB Apple Silicon balanced
RotorQuant-MLX-8bit mlx-lm ~32 GB Apple Silicon reference
TurboQuant runtime modifier n/a KV-cache root (weight-agnostic)
TurboQuant-2bit transformers n/a Standalone 2-bit weights
TurboQuant-MLX-2bit mlx-lm ~8.6 GB Apple Silicon, smallest
TurboQuant-MLX-4bit mlx-lm ~17 GB Apple Silicon balanced
TurboQuant-MLX-8bit mlx-lm ~32 GB Apple Silicon reference
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for majentik/Qwen3.5-27B-RotorQuant-2bit

Base model

Qwen/Qwen3.5-27B
Finetuned
(300)
this model