Instructions to use redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP")
model = AutoModelForMultimodalLM.from_pretrained("redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP

SGLang

How to use redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP with Docker Model Runner:
```
docker model run hf.co/redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP
```

QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP

📖 中文版说明 — 中文模型卡

Model Description

This is the AWQ (Activation-aware Weight Quantization) 4-bit quantized version of SC117/QwenPaw-Flash-9B-heretic.

QwenPaw-Flash-9B-heretic is based on Qwen3.5-9B with a Hybrid Attention architecture:

24 Linear Attention layers (Gated DeltaNet)
8 Full Attention layers (traditional Softmax Attention)
1 MTP (Multi-Token Prediction) Head
27 Vision Encoder layers (multimodal)

After quantization, the model size is reduced from ~38GB (FP32) to 13GB (AWQ INT4), making it runnable on consumer GPUs with 20GB+ VRAM.

Quantization Details

Parameter	Value
Tool	llmcompressor 0.12.1 + compressed-tensors 0.17.2
Format	W4A16 (symmetric int4)
Group Size	128
AWQ Grid	20
Calibration	wikitext-2-raw-v1 (128 samples)
Sequence Length	2048
Inference Precision	bfloat16

Quantization Scope

Component	Precision	Notes
MLP (layers 1-31) — gate/up/down proj	INT4	31 layers, ~4.68B params
Layer 0 (entire)	BF16	First layer kept at full precision
Linear Attention (24 layers)	BF16	Includes conv1d, in_proj_qkv, etc.
Full Attention (8 layers)	BF16	Q/K/V/O projections
Vision Encoder (27 layers)	FP32	Original precision preserved
MTP Head	BF16	Speculative decoding preserved
Embed Tokens + LM Head	BF16	Input/output embeddings

AWQ Smoothing

AWQ smoothing is applied only to MLP components:

post_attention_layernorm → mlp.gate_proj, mlp.up_proj

Inference Compatibility

Framework	Status
SGLang ≥ 0.5.12	✅ Tested and verified
vLLM	❌ Not yet tested
HuggingFace Transformers	✅ Supported

SGLang Launch Example

sglang serve \
  --trust-remote-code \
  --model-path /path/to/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP \
  --host 0.0.0.0 --port 8001 \
  --dtype auto \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.85

Python Load Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP",
    trust_remote_code=True,
)

Model Files

File	Size	Description
`model.safetensors`	10 GB	Quantized text backbone (INT4 + BF16)
`visual_mtp.safetensors`	2.2 GB	Vision encoder (FP32) + MTP head (BF16)
`model.safetensors.index.json`	76 KB	Weight index

Memory Usage

Component	Size
Model weights	~13 GB
KV Cache (fp8, 131K tokens)	~2 GB
Mamba Cache	~1 GB
Total	~16 GB

Recommended GPU: 20GB+ VRAM (RTX 3080 20GB / RTX 3090 / A100).

Disclaimer

This model is a quantized version of the source model, without additional training or fine-tuning. Please comply with the source model's license agreement.

Downloads last month: 102

Safetensors

Model size

9B params

Tensor type

I64

I32

BF16

Model tree for redashes/QwenPaw-Flash-9B-heretic-INT4-AWQ-MTP

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

SC117/QwenPaw-Flash-9B-heretic

Quantized

(6)

this model