mgoin/Meta-Llama-3-8B-Instruct-pruned50-quant-ds

Llama 3 8B Instruct that has been compressed in one-shot to 50% sparsity and INT8 weights+activations using SparseGPT, SmoothQuant, and GPTQ.

Made with SparseML+DeepSparse=1.7. Install with pip install deepsparse~=1.7 "sparseml[transformers]"~=1.7 "numpy<2".

Here is the script used for SparseML compression:

from datasets import load_dataset
from sparseml.transformers import (
    SparseAutoModelForCausalLM,
    SparseAutoTokenizer,
    load_dataset,
    compress,
)

model = SparseAutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto"
)
tokenizer = SparseAutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
dataset = load_dataset("garage-bAInd/Open-Platypus")


def format_data(data):
    instruction = tokenizer.apply_chat_template(
        [{"role": "user", "content": data["instruction"]}],
        tokenize=False,
        add_generation_prompt=True,
    )
    return {"text": instruction + data["output"]}


dataset = dataset.map(format_data)

recipe = """
compression_stage:
    run_type: oneshot
    oneshot_modifiers:
        QuantizationModifier:
            ignore:
                # These operations don't make sense to quantize
                - LlamaRotaryEmbedding
                - LlamaRMSNorm
                - SiLUActivation
                - QuantizableMatMul
                # Skip quantizing the layers with the most sensitive activations
                - model.layers.1.mlp.down_proj
                - model.layers.31.mlp.down_proj
                - model.layers.14.self_attn.q_proj
                - model.layers.14.self_attn.k_proj
                - model.layers.14.self_attn.v_proj
            post_oneshot_calibration: true
            scheme_overrides:
                # Enable channelwise quantization for better accuracy
                Linear:
                    weights:
                        num_bits: 8
                        symmetric: true
                        strategy: channel
                # For the embeddings, only weight-quantization makes sense
                Embedding:
                    input_activations: null
                    weights:
                        num_bits: 8
                        symmetric: false
        SparseGPTModifier:
            sparsity: 0.5
            quantize: True
            targets: ['re:model.layers.\\d*$']
"""

compress(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    recipe=recipe,
    output_dir="./one-shot-checkpoint",
)

mgoin
/

Meta-Llama-3-8B-Instruct-pruned50-quant-ds

Model tree for mgoin/Meta-Llama-3-8B-Instruct-pruned50-quant-ds

Space using mgoin/Meta-Llama-3-8B-Instruct-pruned50-quant-ds 1