Update README.md (#4)
Browse files- Update README.md (51e41de05c6a35d74ded048cc36403d79c54f1b9)
- Update README.md (e4777d630dd7d7377fe1763d5e25b33cb54939d1)
- Update README.md (32958f278ddabc3cab38b1c061bf3c55c51921fc)
- Update README.md (b5422fe29cbd8d7c9996b93a8125aef19ffa2113)
- Update README.md (cfb884630c093e0df25ff873f445913d672af8eb)
Co-authored-by: Alvaro Bartolome <alvarobartt@users.noreply.huggingface.co>
README.md
CHANGED
@@ -116,7 +116,140 @@ The AutoAWQ script has been adapted from [AutoAWQ/examples/generate.py](https://
|
|
116 |
|
117 |
### 🤗 Text Generation Inference (TGI)
|
118 |
|
119 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
120 |
|
121 |
## Quantization Reproduction
|
122 |
|
|
|
116 |
|
117 |
### 🤗 Text Generation Inference (TGI)
|
118 |
|
119 |
+
To run the `text-generation-launcher` with Llama 3.1 70B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and the `huggingface_hub` Python package as you need to login to the Hugging Face Hub.
|
120 |
+
|
121 |
+
```bash
|
122 |
+
pip install -q --upgrade huggingface_hub
|
123 |
+
huggingface-cli login
|
124 |
+
```
|
125 |
+
|
126 |
+
Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows:
|
127 |
+
|
128 |
+
```bash
|
129 |
+
docker run --gpus all --shm-size 1g -ti -p 8080:80 \
|
130 |
+
-v hf_cache:/data \
|
131 |
+
-e MODEL_ID=hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
|
132 |
+
-e NUM_SHARD=4 \
|
133 |
+
-e QUANTIZE=awq \
|
134 |
+
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
|
135 |
+
-e MAX_INPUT_LENGTH=4000 \
|
136 |
+
-e MAX_TOTAL_TOKENS=4096 \
|
137 |
+
ghcr.io/huggingface/text-generation-inference:2.2.0
|
138 |
+
```
|
139 |
+
|
140 |
+
> [!NOTE]
|
141 |
+
> TGI will expose different endpoints, to see all the endpoints available check [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/#/).
|
142 |
+
|
143 |
+
To send request to the deployed TGI endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
|
144 |
+
|
145 |
+
```bash
|
146 |
+
curl 0.0.0.0:8080/v1/chat/completions \
|
147 |
+
-X POST \
|
148 |
+
-H 'Content-Type: application/json' \
|
149 |
+
-d '{
|
150 |
+
"model": "tgi",
|
151 |
+
"messages": [
|
152 |
+
{
|
153 |
+
"role": "system",
|
154 |
+
"content": "You are a helpful assistant."
|
155 |
+
},
|
156 |
+
{
|
157 |
+
"role": "user",
|
158 |
+
"content": "What is Deep Learning?"
|
159 |
+
}
|
160 |
+
],
|
161 |
+
"max_tokens": 128
|
162 |
+
}'
|
163 |
+
```
|
164 |
+
|
165 |
+
Or programatically via the `huggingface_hub` Python client as follows:
|
166 |
+
|
167 |
+
```python
|
168 |
+
import os
|
169 |
+
from huggingface_hub import InferenceClient
|
170 |
+
|
171 |
+
client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
|
172 |
+
|
173 |
+
chat_completion = client.chat.completions.create(
|
174 |
+
model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
|
175 |
+
messages=[
|
176 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
177 |
+
{"role": "user", "content": "What is Deep Learning?"},
|
178 |
+
],
|
179 |
+
max_tokens=128,
|
180 |
+
)
|
181 |
+
```
|
182 |
+
|
183 |
+
Alternatively, the OpenAI Python client can also be used (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
|
184 |
+
|
185 |
+
```python
|
186 |
+
import os
|
187 |
+
from openai import OpenAI
|
188 |
+
|
189 |
+
client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY", "-"))
|
190 |
+
|
191 |
+
chat_completion = client.chat.completions.create(
|
192 |
+
model="tgi",
|
193 |
+
messages=[
|
194 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
195 |
+
{"role": "user", "content": "What is Deep Learning?"},
|
196 |
+
],
|
197 |
+
max_tokens=128,
|
198 |
+
)
|
199 |
+
```
|
200 |
+
|
201 |
+
### vLLM
|
202 |
+
|
203 |
+
To run vLLM with Llama 3.1 70B Instruct AWQ in INT4, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and run the latest vLLM Docker container as follows:
|
204 |
+
|
205 |
+
```bash
|
206 |
+
docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
|
207 |
+
-v hf_cache:/root/.cache/huggingface \
|
208 |
+
vllm/vllm-openai:latest \
|
209 |
+
--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
|
210 |
+
--tensor-parallel-size 4 \
|
211 |
+
--max-model-len 4096
|
212 |
+
```
|
213 |
+
|
214 |
+
To send request to the deployed vLLM endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
|
215 |
+
|
216 |
+
```bash
|
217 |
+
curl 0.0.0.0:8000/v1/chat/completions \
|
218 |
+
-X POST \
|
219 |
+
-H 'Content-Type: application/json' \
|
220 |
+
-d '{
|
221 |
+
"model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
|
222 |
+
"messages": [
|
223 |
+
{
|
224 |
+
"role": "system",
|
225 |
+
"content": "You are a helpful assistant."
|
226 |
+
},
|
227 |
+
{
|
228 |
+
"role": "user",
|
229 |
+
"content": "What is Deep Learning?"
|
230 |
+
}
|
231 |
+
],
|
232 |
+
"max_tokens": 128
|
233 |
+
}'
|
234 |
+
```
|
235 |
+
|
236 |
+
Or programatically via the `openai` Python client (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
|
237 |
+
|
238 |
+
```python
|
239 |
+
import os
|
240 |
+
from openai import OpenAI
|
241 |
+
|
242 |
+
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))
|
243 |
+
|
244 |
+
chat_completion = client.chat.completions.create(
|
245 |
+
model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
|
246 |
+
messages=[
|
247 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
248 |
+
{"role": "user", "content": "What is Deep Learning?"},
|
249 |
+
],
|
250 |
+
max_tokens=128,
|
251 |
+
)
|
252 |
+
```
|
253 |
|
254 |
## Quantization Reproduction
|
255 |
|