robertgshaw2 commited on
Commit
46af74a
1 Parent(s): 764bad9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: HuggingFaceH4/zephyr-7b-beta
3
+ inference: true
4
+ model_type: mistral
5
+ quantized_by: robertgshaw2
6
+ tags:
7
+ - nm-vllm
8
+ - marlin
9
+ - int4
10
+ ---
11
+
12
+ ## TinyLlama-1.1B-Chat-v1.0
13
+ This repo contains model files for [zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) optimized for [nm-vllm](https://github.com/neuralmagic/nm-vllm), a high-throughput serving engine for compressed LLMs.
14
+
15
+ This model was quantized with [GPTQ](https://arxiv.org/abs/2210.17323) and saved in the Marlin format for efficient 4-bit inference. Marlin is a highly optimized inference kernel for 4 bit models.
16
+
17
+ ## Inference
18
+ Install [nm-vllm](https://github.com/neuralmagic/nm-vllm) for fast inference and low memory-usage:
19
+ ```bash
20
+ pip install nm-vllm[sparse]
21
+ ```
22
+
23
+ Run in a Python pipeline for local inference:
24
+ ```python
25
+ from transformers import AutoTokenizer
26
+ from vllm import LLM, SamplingParams
27
+
28
+ model_id = "neuralmagic/zephyr-7b-beta-marlin"
29
+ model = LLM(model_id)
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
32
+ messages = [
33
+ {"role": "user", "content": "What is quantization in maching learning?"},
34
+ ]
35
+ formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
36
+ sampling_params = SamplingParams(max_tokens=200)
37
+ outputs = model.generate(formatted_prompt, sampling_params=sampling_params)
38
+ print(outputs[0].outputs[0].text)
39
+
40
+ """
41
+ Sure! Here's a simple recipe for banana bread:
42
+
43
+ Ingredients:
44
+ - 3-4 ripe bananas,mashed
45
+ - 1 large egg
46
+ - 2 Tbsp. Flour
47
+ - 2 tsp. Baking powder
48
+ - 1 tsp. Baking soda
49
+ - 1/2 tsp. Ground cinnamon
50
+ - 1/4 tsp. Salt
51
+ - 1/2 cup butter, melted
52
+ - 3 Cups All-purpose flour
53
+ - 1/2 tsp. Ground cinnamon
54
+
55
+ Instructions:
56
+
57
+ 1. Preheat your oven to 350 F (175 C).
58
+ """
59
+ ```
60
+
61
+ ## Quantization
62
+ For details on how this model was quantized and converted to marlin format, run the `quantization/apply_gptq_save_marlin.py` script:
63
+
64
+ ```bash
65
+ pip install -r quantization/requirements.txt
66
+ python3 quantization/apply_gptq_save_marlin.py --model-id HuggingFaceH4/zephyr-7b-beta --save-dir ./zephyr-marlin
67
+ ```
68
+
69
+ ## Slack
70
+
71
+ For further support, and discussions on these models and AI in general, join [Neural Magic's Slack Community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)