QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP

📖 中文版说明 — 中文模型卡

Model Description

This is the AWQ (Activation-aware Weight Quantization) 4-bit quantized version of SC117/QwenPaw-Flash-9B-heretic.

QwenPaw-Flash-9B-heretic is based on Qwen3.5-9B with a Hybrid Attention architecture:

  • 24 Linear Attention layers (Gated DeltaNet)
  • 8 Full Attention layers (traditional Softmax Attention)
  • 1 MTP (Multi-Token Prediction) Head
  • 27 Vision Encoder layers (multimodal)

After quantization, the model size is reduced from ~38GB (FP32) to 13GB (AWQ INT4), making it runnable on consumer GPUs with 20GB+ VRAM.

Quantization Details

Parameter Value
Tool llmcompressor 0.12.1 + compressed-tensors 0.17.2
Format W4A16 (symmetric int4)
Group Size 128
AWQ Grid 20
Calibration wikitext-2-raw-v1 (128 samples)
Sequence Length 2048
Inference Precision bfloat16

Quantization Scope

Component Precision Notes
MLP (layers 1-31) — gate/up/down proj INT4 31 layers, ~4.68B params
Layer 0 (entire) BF16 First layer kept at full precision
Linear Attention (24 layers) BF16 Includes conv1d, in_proj_qkv, etc.
Full Attention (8 layers) BF16 Q/K/V/O projections
Vision Encoder (27 layers) FP32 Original precision preserved
MTP Head BF16 Speculative decoding preserved
Embed Tokens + LM Head BF16 Input/output embeddings

AWQ Smoothing

AWQ smoothing is applied only to MLP components:

  • post_attention_layernormmlp.gate_proj, mlp.up_proj

Inference Compatibility

Framework Status
SGLang ≥ 0.5.12 ✅ Tested and verified
vLLM ❌ Not yet tested
HuggingFace Transformers ✅ Supported

SGLang Launch Example

sglang serve \
  --trust-remote-code \
  --model-path /path/to/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP \
  --host 0.0.0.0 --port 8001 \
  --dtype auto \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.85

Python Load Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP",
    trust_remote_code=True,
)

Model Files

File Size Description
model.safetensors 10 GB Quantized text backbone (INT4 + BF16)
visual_mtp.safetensors 2.2 GB Vision encoder (FP32) + MTP head (BF16)
model.safetensors.index.json 76 KB Weight index

Memory Usage

Component Size
Model weights ~13 GB
KV Cache (fp8, 131K tokens) ~2 GB
Mamba Cache ~1 GB
Total ~16 GB

Recommended GPU: 20GB+ VRAM (RTX 3080 20GB / RTX 3090 / A100).

Disclaimer

This model is a quantized version of the source model, without additional training or fine-tuning. Please comply with the source model's license agreement.

Downloads last month
102
Safetensors
Model size
9B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP

Finetuned
Qwen/Qwen3.5-9B
Quantized
(6)
this model