Mistral-7B-Instruct-v0.3 W4A16-AWQ (int4)

A 4-bit (W4A16-AWQ) quantization of mistralai/Mistral-7B-Instruct-v0.3, produced for the BareMetalRT consumer inference engine.

This is a modified, quantized derivative. The weights have been AWQ 4-bit quantized; everything else (architecture, tokenizer, chat template) is unchanged from the base model.

What this is for

A genuine 7B-class model that fits an 8 GB consumer GPU in 4-bit (the bf16 Mistral-7B needs ~15 GB). Mistral's small 32K vocabulary keeps the embedding / lm_head overhead tiny, so unlike larger-vocab 8B models this lands comfortably in 8 GB. It loads on the TensorRT-LLM PyTorch (_torch) backend, which reads hf_quant_config.json and dispatches the W4A16-AWQ kernels — no engine build.

Quantization

Method: W4A16-AWQ (4-bit weights, 16-bit activations), group size 128, with per-input pre-quant scales. lm_head kept in higher precision.
Tooling: NVIDIA TensorRT Model Optimizer (modelopt 0.37) INT4_AWQ_CFG, AWQ scale search calibrated on cnn_dailymail, exported via export_hf_checkpoint.
Format: modelopt unified-HF checkpoint (quant_algo: W4A16_AWQ).

Hardware notes

Ada (RTX 40-series, sm_89)+: autotuner on. Ampere (RTX 30-series, sm_86): autotuner disabled automatically (BareMetalRT handles this per-GPU).

License

Inherits Apache 2.0 from the base model. See LICENSE and NOTICE. Retain the attribution and the notice that the weights were modified (quantized).

Downloads last month: 29

Safetensors

Model size

4B params

Tensor type

F32

BF16

Model tree for BareMetalAI/Mistral-7B-Instruct-v0.3-W4A16-AWQ

Base model

mistralai/Mistral-7B-v0.3

Finetuned

mistralai/Mistral-7B-Instruct-v0.3

Quantized

(266)

this model