Compressed LLMs for nm-vllm
Collection
LLMs compressed using SparseGPT and GPTQ for optimized inference with nm-vllm https://github.com/neuralmagic/nm-vllm
•
18 items
•
Updated
•
9
Mistral-7B-Instruct-v0.3 | Mistral-7B-Instruct-v0.3-GPTQ-4bit (this model) |
|
---|---|---|
arc-c 25-shot |
63.48 | 63.40 |
mmlu 5-shot |
61.13 | 60.89 |
hellaswag 10-shot |
84.49 | 84.04 |
winogrande 5-shot |
79.16 | 79.08 |
gsm8k 5-shot |
43.37 | 45.41 |
truthfulqa 0-shot |
59.65 | 57.48 |
Average Accuracy |
65.21 | 65.05 |
Recovery | 100% | 99.75% |
This model is ready for optimized inference using the Marlin mixed-precision kernels in vLLM: https://github.com/vllm-project/vllm
Simply start this model as an inference server with:
python -m vllm.entrypoints.openai.api_server --model neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit