Edit model card

SparseLlama-3-8B-pruned_50.2of4-FP8

This repo contains model files for a 2:4 (N:M) sparse Meta-Llama-3-8B model pruned in one-shot with SparseGPT, and then additionally retrained with the SquareHead knowledge distillation while maintaining the 2:4 sparsity mask. It was then quantized using AutoFP8 to FP8 weights and activations with per-tensor scales, calibrated on UltraChat2k.

Note: The unquantized SparseLlama-3-8B-pruned_50.2of4-FP8 is still a work in progress and subject to change. This FP8 model will be updated once the unquantized model is updated too.

Evaluation Benchmark Results

Model evaluation results obtained via lm-evaluation-harness following the configuration of Open LLM Leaderboard.

Benchmark Meta-Llama-3-8B SparseLlama-3-8B-pruned_50.2of4 SparseLlama-3-8B-pruned_50.2of4-FP8
(this model)
ARC-c
25-shot
59.47% 57.76% 58.02%
MMLU
5-shot
65.29% 60.44% 60.71%
HellaSwag
10-shot
82.14% 79.97% 79.61%
WinoGrande
5-shot
77.27% 77.19% 76.32%
GSM8K
5-shot
44.81% 47.92% 49.36%
TruthfulQA
0-shot
43.96% 41.02% 40.82%
Average
Accuracy
62.16% 60.72% 60.81%
Recovery 100% 97.68% 97.83%

Help

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

Downloads last month
0
Safetensors
Model size
8.03B params
Tensor type
BF16
·
F8_E4M3
·