Model Card for Mistral-7B-Instruct-v0.3 quantized to 4bit weights

  • Weight-only quantization of Mistral-7B-Instruct-v0.3 via GPTQ to 4bits with group_size=128
  • GPTQ optimized for 99.75% accuracy recovery relative to the unquantized model

Open LLM Leaderboard evaluation scores

Mistral-7B-Instruct-v0.3 Mistral-7B-Instruct-v0.3-GPTQ-4bit
(this model)
arc-c
25-shot
63.48 63.40
mmlu
5-shot
61.13 60.89
hellaswag
10-shot
84.49 84.04
winogrande
5-shot
79.16 79.08
gsm8k
5-shot
43.37 45.41
truthfulqa
0-shot
59.65 57.48
Average
Accuracy
65.21 65.05
Recovery 100% 99.75%

vLLM Inference Performance

This model is ready for optimized inference using the Marlin mixed-precision kernels in vLLM: https://github.com/vllm-project/vllm

Simply start this model as an inference server with:

python -m vllm.entrypoints.openai.api_server --model neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit

image/png

Downloads last month
1,573
Safetensors
Model size
1.21B params
Tensor type
I32
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit

Quantized
(122)
this model

Evaluation results