Qwen3-8B W4A16-AWQ (int4)
A 4-bit (W4A16-AWQ) quantization of Qwen/Qwen3-8B, produced for the BareMetalRT consumer inference engine.
This is a modified, quantized derivative of Qwen3-8B. The weights have been AWQ 4-bit quantized; everything else (architecture, tokenizer, chat template) is unchanged from the base model.
What this is for
It runs ~8B-class quality on an 8 GB consumer GPU (the bf16 Qwen3-8B needs
~16 GB). It loads directly on the TensorRT-LLM PyTorch (_torch) backend, which
reads hf_quant_config.json and dispatches the W4A16-AWQ linear kernels — no
engine build step.
Quantization
- Method: W4A16-AWQ (4-bit weights, 16-bit activations), group size 128, with
per-input pre-quant scales.
lm_headkept in higher precision. - Tooling: NVIDIA TensorRT Model Optimizer (
modelopt0.37)INT4_AWQ_CFG, AWQ scale search calibrated on cnn_dailymail, exported viaexport_hf_checkpoint. - Format: modelopt unified-HF checkpoint (
hf_quant_config.json+quant_algo: W4A16_AWQ).
Hardware notes
- Ada (RTX 40-series, sm_89) and newer: runs with the kernel autotuner on.
- Ampere (RTX 30-series, sm_86): runs with the autotuner disabled (the fpA_intB tactic search does not converge on Ampere). BareMetalRT handles this automatically per-GPU.
- Validated end-to-end (coherence battery) on an RTX 4070 SUPER.
License
Inherits Apache 2.0 from the base model. See LICENSE and NOTICE. You must
retain the attribution and the notice that the weights were modified (quantized).
- Downloads last month
- 28