alvarobartt HF staff commited on
Commit
6e62725
•
1 Parent(s): 7e1dd3f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -15
README.md CHANGED
@@ -35,7 +35,8 @@ In order to use the current quantized model, support is offered for different so
35
  In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
36
 
37
  ```bash
38
- pip install -q --upgrade transformers autoawq accelerate
 
39
  ```
40
 
41
  To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
@@ -81,7 +82,8 @@ print(tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_spe
81
  In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
82
 
83
  ```bash
84
- pip install -q --upgrade transformers autoawq accelerate
 
85
  ```
86
 
87
  Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
@@ -119,24 +121,18 @@ The AutoAWQ script has been adapted from [`AutoAWQ/examples/generate.py`](https:
119
 
120
  ### 🤗 Text Generation Inference (TGI)
121
 
122
- To run the `text-generation-launcher` with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and the `huggingface_hub` Python package as you need to login to the Hugging Face Hub.
123
-
124
- ```bash
125
- pip install -q --upgrade huggingface_hub
126
- huggingface-cli login
127
- ```
128
 
129
- Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows:
130
 
131
  ```bash
132
  docker run --gpus all --shm-size 1g -ti -p 8080:80 \
133
  -v hf_cache:/data \
134
  -e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
135
  -e QUANTIZE=awq \
136
- -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
137
  -e MAX_INPUT_LENGTH=4000 \
138
  -e MAX_TOTAL_TOKENS=4096 \
139
- ghcr.io/huggingface/text-generation-inference:2.2.0
140
  ```
141
 
142
  > [!NOTE]
@@ -166,7 +162,7 @@ Or programatically via the `huggingface_hub` Python client as follows:
166
  import os
167
  from huggingface_hub import InferenceClient
168
 
169
- client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
170
 
171
  chat_completion = client.chat.completions.create(
172
  model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
@@ -183,7 +179,7 @@ Alternatively, the OpenAI Python client can also be used (see [installation note
183
  import os
184
  from openai import OpenAI
185
 
186
- client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY", "-"))
187
 
188
  chat_completion = client.chat.completions.create(
189
  model="tgi",
@@ -243,16 +239,25 @@ chat_completion = client.chat.completions.create(
243
 
244
  ## Quantization Reproduction
245
 
246
- > [!NOTE]
247
  > In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.
 
 
248
 
249
  In order to quantize Gemma2 9B Instruct, first install the following packages:
250
 
251
  ```bash
252
- pip install -q --upgrade "torch==2.3.0" transformers accelerate
253
  INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
254
  ```
255
 
 
 
 
 
 
 
 
256
  Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
257
 
258
  ```python
 
35
  In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
36
 
37
  ```bash
38
+ pip install -q --upgrade "transformers>=4.45.0" accelerate
39
+ INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
40
  ```
41
 
42
  To run the inference on top of Gemma2 9B Instruct AWQ in INT4 precision, the AWQ model can be instantiated as any other causal language modeling model via `AutoModelForCausalLM` and run the inference normally.
 
82
  In order to run the inference with Gemma2 9B Instruct AWQ in INT4, you need to install the following packages:
83
 
84
  ```bash
85
+ pip install -q --upgrade "transformers>=4.45.0" accelerate
86
+ INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
87
  ```
88
 
89
  Alternatively, one may want to run that via `AutoAWQ` even though it's built on top of 🤗 `transformers`, which is the recommended approach instead as described above.
 
121
 
122
  ### 🤗 Text Generation Inference (TGI)
123
 
124
+ To run the `text-generation-launcher` with Gemma2 9B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)).
 
 
 
 
 
125
 
126
+ Then you just need to run the TGI v2.3.0 (or higher) Docker container as follows:
127
 
128
  ```bash
129
  docker run --gpus all --shm-size 1g -ti -p 8080:80 \
130
  -v hf_cache:/data \
131
  -e MODEL_ID=hugging-quants/gemma-2-9b-it-AWQ-INT4 \
132
  -e QUANTIZE=awq \
 
133
  -e MAX_INPUT_LENGTH=4000 \
134
  -e MAX_TOTAL_TOKENS=4096 \
135
+ ghcr.io/huggingface/text-generation-inference:2.3.0
136
  ```
137
 
138
  > [!NOTE]
 
162
  import os
163
  from huggingface_hub import InferenceClient
164
 
165
+ client = InferenceClient(base_url="http://0.0.0.0:8080", api_key="-")
166
 
167
  chat_completion = client.chat.completions.create(
168
  model="hugging-quants/gemma-2-9b-it-AWQ-INT4",
 
179
  import os
180
  from openai import OpenAI
181
 
182
+ client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key="-")
183
 
184
  chat_completion = client.chat.completions.create(
185
  model="tgi",
 
239
 
240
  ## Quantization Reproduction
241
 
242
+ > [!IMPORTANT]
243
  > In order to quantize Gemma2 9B Instruct using AutoAWQ, you will need to use an instance with at least enough CPU RAM to fit the whole model i.e. ~20GiB, and an NVIDIA GPU with 16GiB of VRAM to quantize it.
244
+ >
245
+ > Additionally, you also need to accept the Gemma2 access conditions, as it is a gated model that requires accepting those first.
246
 
247
  In order to quantize Gemma2 9B Instruct, first install the following packages:
248
 
249
  ```bash
250
+ pip install -q --upgrade "torch==2.3.0" "transformers>=4.45.0" accelerate
251
  INSTALL_KERNELS=1 pip install -q git+https://github.com/casper-hansen/AutoAWQ.git@79547665bdb27768a9b392ef375776b020acbf0c
252
  ```
253
 
254
+ Then you need to install the `huggingface_hub` Python SDK and login to the Hugging Face Hub.
255
+
256
+ ```bash
257
+ pip install -q --upgrade huggingface_hub
258
+ huggingface-cli login
259
+ ```
260
+
261
  Then run the following script, adapted from [`AutoAWQ/examples/quantize.py`](https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py):
262
 
263
  ```python