Mistral-7B-Instruct-v0.3 W4A16-AWQ (int4)

A 4-bit (W4A16-AWQ) quantization of mistralai/Mistral-7B-Instruct-v0.3, produced for the BareMetalRT consumer inference engine.

This is a modified, quantized derivative. The weights have been AWQ 4-bit quantized; everything else (architecture, tokenizer, chat template) is unchanged from the base model.

What this is for

A genuine 7B-class model that fits an 8 GB consumer GPU in 4-bit (the bf16 Mistral-7B needs ~15 GB). Mistral's small 32K vocabulary keeps the embedding / lm_head overhead tiny, so unlike larger-vocab 8B models this lands comfortably in 8 GB. It loads on the TensorRT-LLM PyTorch (_torch) backend, which reads hf_quant_config.json and dispatches the W4A16-AWQ kernels — no engine build.

Quantization

  • Method: W4A16-AWQ (4-bit weights, 16-bit activations), group size 128, with per-input pre-quant scales. lm_head kept in higher precision.
  • Tooling: NVIDIA TensorRT Model Optimizer (modelopt 0.37) INT4_AWQ_CFG, AWQ scale search calibrated on cnn_dailymail, exported via export_hf_checkpoint.
  • Format: modelopt unified-HF checkpoint (quant_algo: W4A16_AWQ).

Hardware notes

  • Ada (RTX 40-series, sm_89)+: autotuner on. Ampere (RTX 30-series, sm_86): autotuner disabled automatically (BareMetalRT handles this per-GPU).

License

Inherits Apache 2.0 from the base model. See LICENSE and NOTICE. Retain the attribution and the notice that the weights were modified (quantized).

Downloads last month
29
Safetensors
Model size
4B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BareMetalAI/Mistral-7B-Instruct-v0.3-W4A16-AWQ

Quantized
(266)
this model