Qwen3-8B W4A16-AWQ (int4)

A 4-bit (W4A16-AWQ) quantization of Qwen/Qwen3-8B, produced for the BareMetalRT consumer inference engine.

This is a modified, quantized derivative of Qwen3-8B. The weights have been AWQ 4-bit quantized; everything else (architecture, tokenizer, chat template) is unchanged from the base model.

What this is for

It runs ~8B-class quality on an 8 GB consumer GPU (the bf16 Qwen3-8B needs ~16 GB). It loads directly on the TensorRT-LLM PyTorch (_torch) backend, which reads hf_quant_config.json and dispatches the W4A16-AWQ linear kernels — no engine build step.

Quantization

  • Method: W4A16-AWQ (4-bit weights, 16-bit activations), group size 128, with per-input pre-quant scales. lm_head kept in higher precision.
  • Tooling: NVIDIA TensorRT Model Optimizer (modelopt 0.37) INT4_AWQ_CFG, AWQ scale search calibrated on cnn_dailymail, exported via export_hf_checkpoint.
  • Format: modelopt unified-HF checkpoint (hf_quant_config.json + quant_algo: W4A16_AWQ).

Hardware notes

  • Ada (RTX 40-series, sm_89) and newer: runs with the kernel autotuner on.
  • Ampere (RTX 30-series, sm_86): runs with the autotuner disabled (the fpA_intB tactic search does not converge on Ampere). BareMetalRT handles this automatically per-GPU.
  • Validated end-to-end (coherence battery) on an RTX 4070 SUPER.

License

Inherits Apache 2.0 from the base model. See LICENSE and NOTICE. You must retain the attribution and the notice that the weights were modified (quantized).

Downloads last month
28
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BareMetalAI/Qwen3-8B-W4A16-AWQ

Finetuned
Qwen/Qwen3-8B
Quantized
(335)
this model