Mistral-7B-Instruct-v0.3 W4A16-AWQ (int4)
A 4-bit (W4A16-AWQ) quantization of mistralai/Mistral-7B-Instruct-v0.3, produced for the BareMetalRT consumer inference engine.
This is a modified, quantized derivative. The weights have been AWQ 4-bit quantized; everything else (architecture, tokenizer, chat template) is unchanged from the base model.
What this is for
A genuine 7B-class model that fits an 8 GB consumer GPU in 4-bit (the bf16
Mistral-7B needs ~15 GB). Mistral's small 32K vocabulary keeps the embedding /
lm_head overhead tiny, so unlike larger-vocab 8B models this lands comfortably in
8 GB. It loads on the TensorRT-LLM PyTorch (_torch) backend, which reads
hf_quant_config.json and dispatches the W4A16-AWQ kernels — no engine build.
Quantization
- Method: W4A16-AWQ (4-bit weights, 16-bit activations), group size 128, with
per-input pre-quant scales.
lm_headkept in higher precision. - Tooling: NVIDIA TensorRT Model Optimizer (
modelopt0.37)INT4_AWQ_CFG, AWQ scale search calibrated on cnn_dailymail, exported viaexport_hf_checkpoint. - Format: modelopt unified-HF checkpoint (
quant_algo: W4A16_AWQ).
Hardware notes
- Ada (RTX 40-series, sm_89)+: autotuner on. Ampere (RTX 30-series, sm_86): autotuner disabled automatically (BareMetalRT handles this per-GPU).
License
Inherits Apache 2.0 from the base model. See LICENSE and NOTICE. Retain the
attribution and the notice that the weights were modified (quantized).
- Downloads last month
- 29
Model tree for BareMetalAI/Mistral-7B-Instruct-v0.3-W4A16-AWQ
Base model
mistralai/Mistral-7B-v0.3