Shu commited on
Commit
f646757
1 Parent(s): 04a4be4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -24
README.md CHANGED
@@ -25,54 +25,97 @@ This repo includes two types of quantized models: **GGUF** and **AWQ**, for our
25
 
26
 
27
  # GGUF Qauntization
28
- Run with [Ollama](https://github.com/ollama/ollama)
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ```bash
31
- ollama run NexaAIDev/octopus-v2-Q4_K_M
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ```
33
 
34
  # AWQ Quantization
35
  Python example:
36
 
37
  ```python
 
38
  from awq import AutoAWQForCausalLM
39
- from transformers import AutoTokenizer, GemmaForCausalLM
40
  import torch
41
  import time
42
  import numpy as np
43
 
44
  def inference(input_text):
45
-
46
- tokens = tokenizer(
47
- input_text,
48
- return_tensors='pt'
49
- ).input_ids.cuda()
50
-
51
  start_time = time.time()
 
 
52
  generation_output = model.generate(
53
- tokens,
54
- do_sample=True,
55
- temperature=0.7,
56
- top_p=0.95,
57
- top_k=40,
58
- max_new_tokens=512
59
  )
60
  end_time = time.time()
61
 
62
- res = tokenizer.decode(generation_output[0])
63
- res = res.split(input_text)
 
 
64
  latency = end_time - start_time
65
- output_tokens = tokenizer.encode(res)
66
- num_output_tokens = len(output_tokens)
67
  throughput = num_output_tokens / latency
68
 
69
- return {"output": res[-1], "latency": latency, "throughput": throughput}
70
 
71
-
72
- model_id = "path/to/Octopus-v2-AWQ"
 
73
  model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
74
  trust_remote_code=False, safetensors=True)
75
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
76
 
77
  prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
78
 
@@ -115,4 +158,3 @@ _Quantized with llama.cpp_
115
 
116
  **Acknowledgement**:
117
  We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.
118
-
 
25
 
26
 
27
  # GGUF Qauntization
28
+ ## (Recommended) Run with [llama.cpp](https://github.com/ggerganov/llama.cpp)
29
 
30
+ 1. **Clone and compile:**
31
+
32
+ ```bash
33
+ git clone https://github.com/ggerganov/llama.cpp
34
+ cd llama.cpp
35
+ # Compile the source code:
36
+ make
37
+ ```
38
+
39
+ 2. **Prepare the Input Prompt File:**
40
+
41
+ Navigate to the `prompt` folder inside the `llama.cpp`, and create a new file named `chat-with-octopus.txt`.
42
+
43
+ `chat-with-octopus.txt`:
44
+
45
+ ```bash
46
+ User:
47
+ ```
48
+
49
+ 3. **Execute the Model:**
50
+
51
+ Run the following command in the terminal:
52
+
53
+ ```bash
54
+ ./main -m ./path/to/octopus-v2-Q4_K_M.gguf -c 512 -b 2048 -n 256 -t 1 --repeat_penalty 1.0 --top_k 0 --top_p 1.0 --color -i -r "User:" -f prompts/chat-with-octopus.txt
55
+ ```
56
+
57
+ Example prompt to interact
58
+ ```bash
59
+ <|system|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<|end|><|user|>Query: Take a selfie for me with front camera<|end|><|assistant|>
60
+ ```
61
+
62
+ ## Run with [Ollama](https://github.com/ollama/ollama)
63
+ 1. Create a `Modelfile` in your directory and include a `FROM` statement with the path to your local model:
64
  ```bash
65
+ FROM ./path/to/octopus-v2-Q4_K_M.gguf
66
+ ```
67
+
68
+ 2. Use the following command to add the model to Ollama:
69
+ ```bash
70
+ ollama create octopus-v2-Q4_K_M -f Modelfile
71
+ ```
72
+
73
+ 3. Verify that the model has been successfully imported:
74
+ ```bash
75
+ ollama ls
76
+ ```
77
+
78
+ ### Run the model
79
+ ```bash
80
+ ollama run octopus-v2-Q4_K_M "<|system|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<|end|><|user|>Query: Take a selfie for me with front camera<|end|><|assistant|>"
81
  ```
82
 
83
  # AWQ Quantization
84
  Python example:
85
 
86
  ```python
87
+ from transformers import AutoTokenizer
88
  from awq import AutoAWQForCausalLM
 
89
  import torch
90
  import time
91
  import numpy as np
92
 
93
  def inference(input_text):
 
 
 
 
 
 
94
  start_time = time.time()
95
+ input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
96
+ input_length = input_ids["input_ids"].shape[1]
97
  generation_output = model.generate(
98
+ input_ids["input_ids"],
99
+ do_sample=False,
100
+ max_length=1024
 
 
 
101
  )
102
  end_time = time.time()
103
 
104
+ # Decode only the generated part
105
+ generated_sequence = generation_output[:, input_length:].tolist()
106
+ res = tokenizer.decode(generated_sequence[0])
107
+
108
  latency = end_time - start_time
109
+ num_output_tokens = len(generated_sequence[0])
 
110
  throughput = num_output_tokens / latency
111
 
112
+ return {"output": res, "latency": latency, "throughput": throughput}
113
 
114
+ # Initialize tokenizer and model
115
+ model_id = "/home/mingyuanma/Octopus-v2-AWQ-NexaAIDev"
116
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
117
  model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
118
  trust_remote_code=False, safetensors=True)
 
119
 
120
  prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
121
 
 
158
 
159
  **Acknowledgement**:
160
  We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.