Qwen3-8B W4A16-AWQ (int4)

A 4-bit (W4A16-AWQ) quantization of Qwen/Qwen3-8B, produced for the BareMetalRT consumer inference engine.

This is a modified, quantized derivative of Qwen3-8B. The weights have been AWQ 4-bit quantized; everything else (architecture, tokenizer, chat template) is unchanged from the base model.

What this is for

It runs ~8B-class quality on an 8 GB consumer GPU (the bf16 Qwen3-8B needs ~16 GB). It loads directly on the TensorRT-LLM PyTorch (_torch) backend, which reads hf_quant_config.json and dispatches the W4A16-AWQ linear kernels — no engine build step.

Quantization

Method: W4A16-AWQ (4-bit weights, 16-bit activations), group size 128, with per-input pre-quant scales. lm_head kept in higher precision.
Tooling: NVIDIA TensorRT Model Optimizer (modelopt 0.37) INT4_AWQ_CFG, AWQ scale search calibrated on cnn_dailymail, exported via export_hf_checkpoint.
Format: modelopt unified-HF checkpoint (hf_quant_config.json + quant_algo: W4A16_AWQ).

Hardware notes

Ada (RTX 40-series, sm_89) and newer: runs with the kernel autotuner on.
Ampere (RTX 30-series, sm_86): runs with the autotuner disabled (the fpA_intB tactic search does not converge on Ampere). BareMetalRT handles this automatically per-GPU.
Validated end-to-end (coherence battery) on an RTX 4070 SUPER.

License

Inherits Apache 2.0 from the base model. See LICENSE and NOTICE. You must retain the attribution and the notice that the weights were modified (quantized).

Downloads last month: 28

Safetensors

Model size

5B params

Tensor type

F32

BF16

Model tree for BareMetalAI/Qwen3-8B-W4A16-AWQ

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(335)

this model