Edit model card

llama2.c-stories110M-pruned50

This repo contains model files for llama2.c 110M tinystories optimized for NM-vLLM, a high-throughput serving engine for compressed LLMs.

This model was pruned with SparseGPT, using SparseML.

Inference

Install NM-vLLM for fast inference and low memory-usage:

pip install nm-vllm[sparse]

Run in a Python pipeline for local inference:

from vllm import LLM, SamplingParams

model = LLM("nm-testing/llama2.c-stories110M-pruned50", sparsity="sparse_w16a16")
prompt = "Hello my name is"

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate(prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Prompt template

N/A

Sparsification

For details on how this model was sparsified, see the recipe.yaml in this repo and follow the instructions below.

Install SparseML:

git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"

Replace the recipe as you like and run this one-shot compression script to apply SparseGPT:

import sparseml.transformers

original_model_name = "Xenova/llama2.c-stories110M"
calibration_dataset = "open_platypus"
output_directory = "output/"

recipe = """
test_stage:
  obcq_modifiers:
    SparseGPTModifier:
      sparsity: 0.5
      sequential_update: true
      targets: ['re:model.layers.\d*$']
"""

# Apply SparseGPT to the model
sparseml.transformers.oneshot(
    model=original_model_name,
    dataset=calibration_dataset,
    recipe=recipe,
    output_dir=output_directory,
)

Slack

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

Downloads last month
26
Safetensors
Model size
109M params
Tensor type
I64
F32
U8

Finetuned from