Commit
•
6e62725
1
Parent(s):
7e1dd3f
Update README.md
Browse files
README.md
CHANGED
@@ -35,7 +35,8 @@ In order to use the current quantized model, support is offered for different so
|
|
35 |
In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
|
36 |
|
37 |
```bash
|
38 |
-
pip install -q --upgrade transformers
|
|
|
39 |
```
|
40 |
|
41 |
To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
|
@@ -81,7 +82,8 @@ print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_spe
|
|
81 |
In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
|
82 |
|
83 |
```bash
|
84 |
-
pip install -q --upgrade transformers
|
|
|
85 |
```
|
86 |
|
87 |
Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
|
@@ -119,24 +121,18 @@ The AutoAWQ script has been adapted from [`AutoAWQ/examples/generate.py`](https:
|
|
119 |
|
120 |
### 🤗 Text Generation Inference (TGI)
|
121 |
|
122 |
-
To run the `text-generation-launcher` with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/))
|
123 |
-
|
124 |
-
```bash
|
125 |
-
pip install -q --upgrade huggingface_hub
|
126 |
-
huggingface-cli login
|
127 |
-
```
|
128 |
|
129 |
-
Then you just need to run the TGI v2.
|
130 |
|
131 |
```bash
|
132 |
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
|
133 |
-v hf_cache:/data \
|
134 |
-e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
|
135 |
-e QUANTIZE=awq \
|
136 |
-
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
|
137 |
-e MAX_INPUT_LENGTH=4000 \
|
138 |
-e MAX_TOTAL_TOKENS=4096 \
|
139 |
-
ghcr.io/huggingface/text-generation-inference:2.
|
140 |
```
|
141 |
|
142 |
> [!NOTE]
|
@@ -166,7 +162,7 @@ Or programatically via the `huggingface_hub` Python client as follows:
|
|
166 |
import os
|
167 |
from huggingface_hub import InferenceClient
|
168 |
|
169 |
-
client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=
|
170 |
|
171 |
chat_completion = client.chat.completions.create(
|
172 |
model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
|
@@ -183,7 +179,7 @@ Alternatively, the OpenAI Python client can also be used (see [installation note
|
|
183 |
import os
|
184 |
from openai import OpenAI
|
185 |
|
186 |
-
client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=
|
187 |
|
188 |
chat_completion = client.chat.completions.create(
|
189 |
model="tgi",
|
@@ -243,16 +239,25 @@ chat_completion = client.chat.completions.create(
|
|
243 |
|
244 |
## Quantization Reproduction
|
245 |
|
246 |
-
> [!
|
247 |
> In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.
|
|
|
|
|
248 |
|
249 |
In order to quantize Gemma2 9B Instruct, first install the following packages:
|
250 |
|
251 |
```bash
|
252 |
-
pip install -q --upgrade "torch==2.3.0" transformers accelerate
|
253 |
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
|
254 |
```
|
255 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
256 |
Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
|
257 |
|
258 |
```python
|
|
|
35 |
In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
|
36 |
|
37 |
```bash
|
38 |
+
pip install -q --upgrade "transformers>=4.45.0" accelerate
|
39 |
+
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
|
40 |
```
|
41 |
|
42 |
To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
|
|
|
82 |
In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
|
83 |
|
84 |
```bash
|
85 |
+
pip install -q --upgrade "transformers>=4.45.0" accelerate
|
86 |
+
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
|
87 |
```
|
88 |
|
89 |
Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
|
|
|
121 |
|
122 |
### 🤗 Text Generation Inference (TGI)
|
123 |
|
124 |
+
To run the `text-generation-launcher` with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)).
|
|
|
|
|
|
|
|
|
|
|
125 |
|
126 |
+
Then you just need to run the TGI v2.3.0 (or higher) Docker container as follows:
|
127 |
|
128 |
```bash
|
129 |
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
|
130 |
-v hf_cache:/data \
|
131 |
-e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
|
132 |
-e QUANTIZE=awq \
|
|
|
133 |
-e MAX_INPUT_LENGTH=4000 \
|
134 |
-e MAX_TOTAL_TOKENS=4096 \
|
135 |
+
ghcr.io/huggingface/text-generation-inference:2.3.0
|
136 |
```
|
137 |
|
138 |
> [!NOTE]
|
|
|
162 |
import os
|
163 |
from huggingface_hub import InferenceClient
|
164 |
|
165 |
+
client = InferenceClient(base_url="http://0.0.0.0:8080", api_key="-")
|
166 |
|
167 |
chat_completion = client.chat.completions.create(
|
168 |
model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
|
|
|
179 |
import os
|
180 |
from openai import OpenAI
|
181 |
|
182 |
+
client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key="-")
|
183 |
|
184 |
chat_completion = client.chat.completions.create(
|
185 |
model="tgi",
|
|
|
239 |
|
240 |
## Quantization Reproduction
|
241 |
|
242 |
+
> [!IMPORTANT]
|
243 |
> In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.
|
244 |
+
>
|
245 |
+
> Additionally, you also need to accept the Gemma2 access conditions, as it is a gated model that requires accepting those first.
|
246 |
|
247 |
In order to quantize Gemma2 9B Instruct, first install the following packages:
|
248 |
|
249 |
```bash
|
250 |
+
pip install -q --upgrade "torch==2.3.0" "transformers>=4.45.0" accelerate
|
251 |
INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
|
252 |
```
|
253 |
|
254 |
+
Then you need to install the `huggingface_hub` Python SDK and login to the Hugging Face Hub.
|
255 |
+
|
256 |
+
```bash
|
257 |
+
pip install -q --upgrade huggingface_hub
|
258 |
+
huggingface-cli login
|
259 |
+
```
|
260 |
+
|
261 |
Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
|
262 |
|
263 |
```python
|