MaskLLM: Learnable Semi-structured Sparsity for Large Language Models
This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during inference. The proposed method is scalable and stands to benefit from larger training datasets.
Requirements
We provide pre-computed masks for Huggingface Models such as Llama-2 7B and Llama-3 8B with the minimum requirements. It will not involve docker, Megatron or data preprocessing.
pip install transformers accelerate datasets SentencePiece
Pre-computed Masks
The following masks were trained and provided by @VainF. We use huggingface_hub
to automatically download those masks and apply them to offcical LLMs for evaluation. Those mask files were compressed using numpy.savez_compressed. More results for baselines (SparseGPT, Wanda) can be found in the appendix.
Model | Pattern | Training Data | Training/Eval SeqLen | PPL (Dense) | PPL (SparseGPT) | PPL (MaskLLM) | Link |
---|---|---|---|---|---|---|---|
LLaMA-2 7B | 2:4 | C4 (2B Tokens) | 4096 | 5.12 | 10.42 | 6.78 | HuggingFace |
LLaMA-3 8B | 2:4 | C4 (2B Tokens) | 4096 | 5.75 | 17.64 | 8.49 | HuggingFace |
LLaMA-3.1 8B | 2:4 | C4 (2B Tokens) | 4096 | 5.89 | 18.65 | 8.58 | HuggingFace |
How to use it
Please see NVlabs/MaskLLM.
Model tree for Vinnnf/LLaMA-3.1-8B-MaskLLM-C4
Base model
meta-llama/Llama-3.1-8B