Instructions to use kumar2235/Qwen3.5-4B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kumar2235/Qwen3.5-4B-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="kumar2235/Qwen3.5-4B-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("kumar2235/Qwen3.5-4B-AWQ")
model = AutoModelForMultimodalLM.from_pretrained("kumar2235/Qwen3.5-4B-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kumar2235/Qwen3.5-4B-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kumar2235/Qwen3.5-4B-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kumar2235/Qwen3.5-4B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/kumar2235/Qwen3.5-4B-AWQ

SGLang

How to use kumar2235/Qwen3.5-4B-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kumar2235/Qwen3.5-4B-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kumar2235/Qwen3.5-4B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kumar2235/Qwen3.5-4B-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kumar2235/Qwen3.5-4B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use kumar2235/Qwen3.5-4B-AWQ with Docker Model Runner:
```
docker model run hf.co/kumar2235/Qwen3.5-4B-AWQ
```

Qwen3.5-4B-AWQ

AWQ INT4 quantization of Qwen/Qwen3.5-4B using llm-compressor and 512 OpenPlatypus calibration samples.

2.56× smaller on disk and ~61.7% lower VRAM usage while maintaining strong benchmark performance.

Model compression

	BF16 baseline	AWQ INT4 (this model)
Model size	~8.0 GB	~3.13 GB (2.56x smaller)
VRAM at load	~8.0 GB	~3.06 GB (2.61x smaller)
Bits / weight	16	4 (4x fewer)

Benchmarks

Note: these are the quantized model's standalone scores from EleutherAI lm-evaluation-harness, default settings, 0-shot. HellaSwag and ARC-Challenge use acc_norm; PIQA, Winogrande, and ARC-Easy use acc, matching each task's harness default. A matched BF16-vs-INT4 delta on identical hardware and settings has not yet been run for this model; treat the scores below as standalone results rather than a verified quantization delta.

Benchmark	Metric	Score
PIQA	`acc`	77.69
Winogrande	`acc`	68.75
HellaSwag	`acc_norm`	71.65
ARC-Easy	`acc`	73.53
ARC-Challenge	`acc_norm`	51.71

Average score: 68.67%

Quantization recipe

Setting	Value
Method	AWQ
Scheme	W4A16_ASYM
Group size	128
Zero point	True
Calibration dataset	OpenPlatypus, 512 samples
Max sequence length	1024
Tool	llm-compressor
Format	compressed-tensors

Calibration used real instruction-following data from OpenPlatypus rather than data-free quantization techniques.

Usage

With transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "kumar2235/Qwen3.5-4B-AWQ"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "Explain machine learning in one paragraph."

inputs = tokenizer(
    prompt,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="kumar2235/Qwen3.5-4B-AWQ"
)

outputs = llm.generate(
    ["Explain machine learning in one paragraph."],
    SamplingParams(
        temperature=0.7,
        max_tokens=256
    )
)

print(outputs[0].outputs[0].text)

Sample output

Prompt:

Explain machine learning in one paragraph.

Response:

Machine learning is a branch of artificial intelligence that enables computers to learn patterns from data and improve their performance on tasks without being explicitly programmed for every situation. By analyzing large amounts of information, machine learning models can make predictions, classify data, recognize patterns, and support decision-making. It powers applications such as recommendation systems, image recognition, language translation, fraud detection, and autonomous systems.

Hardware

Component	Specification
GPU (calibration)	NVIDIA RTX 6000 Ada
GPU Memory	49 GB
CUDA	13.2
Quantization tool	llm-compressor
Quantization method	AWQ W4A16_ASYM

Weights: ~3.13 GB on disk, ~3.06 GB VRAM at load
Single-GPU friendly: comfortably fits on 8 GB+ consumer cards for local inference and edge deployment

Limitations

Benchmarks above are standalone scores for the quantized model; they have not yet been diffed against a BF16 run under identical harness settings, so the true accuracy delta from quantization is not yet confirmed
Calibration set was OpenPlatypus (English-leaning instruction data) — heavy non-English or domain-specific workloads may benefit from re-quantizing on a matching corpus
Max sequence length used during calibration was 1024 tokens; behavior at much longer contexts has not been separately validated

License

Inherits the license of the base model. See the Qwen/Qwen3.5-4B model page for terms.

Citation

Base model

@misc{qwen3.5-4b,
    title  = {{Qwen3.5-4B}},
    author = {{Qwen Team}},
    year   = {2025},
    url    = {https://huggingface.co/Qwen/Qwen3.5-4B}
}

Quantization method

@article{lin2023awq,
    title   = {{AWQ}: Activation-aware Weight Quantization for LLM Compression and Acceleration},
    author  = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
    journal = {arXiv preprint arXiv:2306.00978},
    year    = {2023}
}

Storage Format

This model uses the compressed-tensors format.

Hugging Face may display BF16/I32/I64 tensor types because compressed AWQ models store quantization metadata, scales, and packed weights separately. The model loads and runs as a compressed AWQ INT4 model through Transformers and llm-compressor.

Downloads last month: 31

Safetensors

Model size

4B params

Tensor type

I64

I32

BF16

Model tree for kumar2235/Qwen3.5-4B-AWQ

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Quantized

(254)

this model

Paper for kumar2235/Qwen3.5-4B-AWQ

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper • 2306.00978 • Published Jun 1, 2023 • 13

Evaluation results

acc on PIQA
self-reported

77.690
acc on Winogrande
self-reported

68.750
acc_norm on HellaSwag
self-reported

71.650
acc on ARC-Easy
self-reported

73.530
acc_norm on ARC-Challenge
self-reported

51.710