Instructions to use majentik/Qwen3.5-27B-RotorQuant-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use majentik/Qwen3.5-27B-RotorQuant-2bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="majentik/Qwen3.5-27B-RotorQuant-2bit")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("majentik/Qwen3.5-27B-RotorQuant-2bit", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use majentik/Qwen3.5-27B-RotorQuant-2bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "majentik/Qwen3.5-27B-RotorQuant-2bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/Qwen3.5-27B-RotorQuant-2bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/majentik/Qwen3.5-27B-RotorQuant-2bit
- SGLang
How to use majentik/Qwen3.5-27B-RotorQuant-2bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "majentik/Qwen3.5-27B-RotorQuant-2bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/Qwen3.5-27B-RotorQuant-2bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "majentik/Qwen3.5-27B-RotorQuant-2bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/Qwen3.5-27B-RotorQuant-2bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use majentik/Qwen3.5-27B-RotorQuant-2bit with Docker Model Runner:
docker model run hf.co/majentik/Qwen3.5-27B-RotorQuant-2bit
Qwen3.5-27B-RotorQuant-2bit
2-bit KV cache compression for Qwen/Qwen3.5-27B using RotorQuant.
This is a KV-cache-only repository. It contains no model weight files β only the configuration and model card for applying RotorQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.
Overview
Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.
RotorQuant applies rotation-based isotropic quantization to the KV cache, achieving better quality and speed than standard quantization approaches at the same bit width.
RotorQuant Advantages
| Metric | RotorQuant 2-bit | Standard 2-bit |
|---|---|---|
| Prefill speed | 5.3x faster | Baseline |
| Decode speed | 28% faster | Baseline |
| Perplexity | 6.91 | 7.07 |
RotorQuant achieves lower perplexity (better quality) while also being faster β a rare combination at aggressive quantization levels.
Specifications
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-27B |
| Parameters | 27B |
| Architecture | Hybrid Transformer |
| Native context | 262,144 tokens |
| Thinking mode | Yes |
| KV cache method | RotorQuant 2-bit (IsoQuant) |
| KV cache compression | ~10x vs FP16 |
| Weights | Original (FP16/BF16, loaded separately) |
Memory Estimates
| Component | Estimate |
|---|---|
| Model weights (BF16) | ~54 GB |
| KV cache at 128K context (2-bit RotorQuant) | ~1.3 GB |
| KV cache at 128K context (FP16, baseline) | ~12.8 GB |
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache
model_id = "Qwen/Qwen3.5-27B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
# Apply 2-bit RotorQuant KV cache compression
cache = IsoQuantCache(bits=2)
messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=2048,
past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quality Notes
- 2-bit is aggressive quantization, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
- Best suited for memory-constrained scenarios where fitting long-context inference on limited hardware is essential.
- For higher quality with moderate compression, consider 4-bit KV cache variants.
- Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.
References
- RotorQuant β Rotation-based isotropic KV cache quantization
- Qwen3.5-27B base model
See Also
- majentik/Qwen3.5-27B-TurboQuant-2bit β TurboQuant 2-bit KV cache variant
- majentik/Qwen3.5-27B-TurboQuant-MLX-2bit β MLX 2-bit weights + TurboQuant KV cache
- majentik/Qwen3.5-27B-RotorQuant-MLX-2bit β MLX 2-bit weights + RotorQuant KV cache
Variants in this family
(Showing 16 sibling variants under majentik/qwen3.5-27b-*. The current variant β RotorQuant-2bit β is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-2bit | transformers | n/a | Standalone 2-bit weights |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~23 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~16 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~21 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~30 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~36 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~57 GB | Near-lossless reference |
| RotorQuant-MLX-2bit | mlx-lm | ~8.6 GB | Apple Silicon, smallest |
| RotorQuant-MLX-4bit | mlx-lm | ~17 GB | Apple Silicon balanced |
| RotorQuant-MLX-8bit | mlx-lm | ~32 GB | Apple Silicon reference |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-2bit | transformers | n/a | Standalone 2-bit weights |
| TurboQuant-MLX-2bit | mlx-lm | ~8.6 GB | Apple Silicon, smallest |
| TurboQuant-MLX-4bit | mlx-lm | ~17 GB | Apple Silicon balanced |
| TurboQuant-MLX-8bit | mlx-lm | ~32 GB | Apple Silicon reference |
Model tree for majentik/Qwen3.5-27B-RotorQuant-2bit
Base model
Qwen/Qwen3.5-27B