File size: 2,958 Bytes
2e024c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
base_model: microsoft/Orca-2-13b
inference: false
model_type: llama
prompt_template: |
  <|im_start|>system
  {system_message}<|im_end|>
  <|im_start|>user
  {prompt}<|im_end|>
  <|im_start|>assistant
quantized_by: mgoin
tags:
- deepsparse
---

# Orca 2 7B - DeepSparse

This repo contains model files for [Microsoft's Orca 2 7B](https://huggingface.co/microsoft/Orca-2-13b) optimized for [DeepSparse](https://github.com/neuralmagic/deepsparse), a CPU inference runtime for sparse models.

This model was quantized and pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml).

## Inference

Install [DeepSparse LLM](https://github.com/neuralmagic/deepsparse) for fast inference on CPUs: 
```
pip install deepsparse-nightly[llm]
```

Run in a [Python pipeline](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md):
```python
from deepsparse import TextGeneration
system_message = ""
prompt = "Who inspires you the most?"
formatted_prompt = f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"
model = TextGeneration(model="hf:mgoin/Orca-2-13b-pruned50-quant-ds")
print(model(formatted_prompt, max_new_tokens=100).generations[0].text)
"""
That's a difficult question as there are many people who inspire me. However, one person who inspires me the most is my mother. She has shown me the importance of hard work, resilience, and perseverance. She has shown me how to overcome obstacles and how to be a strong and independent woman.
"""
```

## Prompt template: ChatML

```
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

```

## Sparsification

For details on how this model was sparsified, see the `recipe.yaml` in this repo and follow the instructions below.

```bash
git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py microsoft/Orca-2-13b open_platypus --recipe recipe.yaml --save True
python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
cp deployment/model.onnx deployment/model-orig.onnx
```

Run this kv-cache injection afterwards:
```python
import os
import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
input_file = "deployment/model-orig.onnx"
output_file = "deployment/model.onnx"
model = onnx.load(input_file, load_external_data=False)
model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
onnx.save(model, output_file)
print(f"Modified model saved to: {output_file}")
```

## Slack

For further support, and discussions on these models and AI in general, join us at [Neural Magic's Slack server](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)