MiniChat-3B - DeepSparse

This repo contains model files for MiniChat-3B optimized for DeepSparse, a CPU inference runtime for sparse models.

This model was quantized and pruned with SparseGPT, using SparseML.

Inference

Install DeepSparse LLM for fast inference on CPUs:

pip install deepsparse-nightly[llm]

Run in a Python pipeline:

from deepsparse import TextGeneration

prompt = "How to get in a good university?"
formatted_prompt =  f"<s> [|User|]\n{prompt}</s>[|Assistant|]\n"

model = TextGeneration(model_path="hf:nm-testing/MiniChat-3B-pruned50-quant-ds")

print(model(formatted_prompt, max_new_tokens=500).generations[0].text)
"""
To get into a good university, you should focus on your academic performance and strive to achieve high grades.
This can be done by setting realistic goals and targets, regularly reviewing your progress, and seeking help from teachers or tutors if needed.
Additionally, participating in extracurricular activities and building a network of friends can also help in getting into a good university.
"""

from deepsparse import TextGeneration

prompt = "How to become a great software engineer?"
formatted_prompt =  f"<s> [|User|]\n{prompt}</s>[|Assistant|]\n"

model = TextGeneration(model="hf:nm-testing/MiniChat-3B-pruned50-quant-ds")

print(model(formatted_prompt, max_new_tokens=500).generations[0].text)
"""
To become a great software engineer, you need to have a strong foundation in computer science and programming. Here are some tips to help you become a great software engineer:
1. Learn a programming language: You need to learn at least one programming language that you can use to develop software applications. Some popular programming languages include Python, Java, and C++.
2. Learn about data structures and algorithms: You need to learn about data structures and algorithms that you can use to develop software applications. You can learn about data structures like arrays, linked lists, and trees, and algorithms like sorting algorithms and dynamic programming.
3. Practice your skills: You need to practice your skills in programming and data structures to become proficient in your chosen programming language. You can practice by working on open-source projects or contributing to open-source projects.
4. Keep up to date: You need to keep up to date with new technologies and programming languages to stay relevant in the field. You can keep up to date by reading blogs, attending meetups, and participating in online communities.
5. Collaborate with others: You can collaborate with others to develop software applications that can benefit society. You can collaborate with others by participating in open-source projects, contributing to open-source communities, and sharing knowledge with others.
By following these tips, you can become a great software engineer and develop software applications that can benefit society.
"""

Prompt template


  <s> [|User|]\n
  {prompt}
  </s>[|Assistant|]\n

Sparsification

For details on how this model was sparsified, see the recipe.yaml in this repo and follow the instructions below.

git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py GeneZC/MiniChat-3B open_platypus --recipe recipe.yaml --save True
python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment 
cp deployment/model.onnx deployment/model-orig.onnx

Run this kv-cache injection to speed up the model at inference by caching the Key and Value states:

import os
import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
input_file = "deployment/model-orig.onnx"
output_file = "deployment/model.onnx"
model = onnx.load(input_file, load_external_data=False)
model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
onnx.save(model, output_file)
print(f"Modified model saved to: {output_file}")

Follow the instructions on our One Shot With SparseML page for a step-by-step guide for performing one-shot quantization of large language models.

Slack

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community

neuralmagic
/

MiniChat-3B-pruned50-quant-ds

MiniChat-3B - DeepSparse

Inference

Prompt template

Sparsification

Slack

Model tree for neuralmagic/MiniChat-3B-pruned50-quant-ds

Collection including neuralmagic/MiniChat-3B-pruned50-quant-ds

DeepSparse Sparse LLMs