hugging-quants
/

Meta-Llama-3.1-405B-Instruct-GPTQ-INT4

Text Generation

Transformers

Safetensors

text-generation-inference

Inference Endpoints

4-bit precision

gptq

Model card Files Files and versions Community

reach-vb HF staff

alvarobartt HF staff commited on Jul 24

Commit

04c4b2d

•

1 Parent(s): 51de55c

Update README.md (#5)

Browse files

- Update README.md (57492d6f08fe62476c6c2cb0469d2e208e87054e)
- Update README.md (59af7ca3372c1acd5330091322bbeca6fd4bd6ce)

Co-authored-by: Alvaro Bartolome <alvarobartt@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +136 -1

README.md CHANGED Viewed

@@ -122,7 +122,142 @@ The AutoGPTQ script has been adapted from [`AutoGPTQ/examples/quantization/basic
 ### 🤗 Text Generation Inference (TGI)
-Coming soon!
 ## Quantization Reproduction

 ### 🤗 Text Generation Inference (TGI)
+To run the `text-generation-launcher` with Llama 3.1 405B Instruct GPTQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and the `huggingface_hub` Python package as you need to login to the Hugging Face Hub.
+```bash
+pip install -q --upgrade huggingface_hub
+huggingface-cli login
+```
+Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows:
+```bash
+docker run --gpus all --shm-size 1g -ti -p 8080:80 \
+  -v hf_cache:/data \
+  -e MODEL_ID=hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 \
+  -e NUM_SHARD=8 \
+  -e QUANTIZE=gptq \
+  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
+  -e MAX_INPUT_LENGTH=4000 \
+  -e MAX_TOTAL_TOKENS=4096 \
+  ghcr.io/huggingface/text-generation-inference:2.2.0
+```
+> [!NOTE]
+> TGI will expose different endpoints, to see all the endpoints available check [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/#/).
+To send request to the deployed TGI endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
+```bash
+curl 0.0.0.0:8080/v1/chat/completions \
+  -X POST \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "tgi",
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "What is Deep Learning?"
+      }
+    ],
+    "max_tokens": 128
+  }'
+```
+Or programatically via the `huggingface_hub` Python client as follows:
+```python
+import os
+from huggingface_hub import InferenceClient
+client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
+chat_completion = client.chat.completions.create(
+  model="hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4",
+  messages=[
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is Deep Learning?"},
+  ],
+  max_tokens=128,
+)
+```
+Alternatively, the OpenAI Python client can also be used (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
+```python
+import os
+from openai import OpenAI
+client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY", "-"))
+chat_completion = client.chat.completions.create(
+  model="tgi",
+  messages=[
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is Deep Learning?"},
+  ],
+  max_tokens=128,
+)
+```
+### vLLM
+To run vLLM with Llama 3.1 405B Instruct GPTQ in INT4, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and run the latest vLLM Docker container as follows:
+```bash
+docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
+  -v hf_cache:/root/.cache/huggingface \
+  vllm/vllm-openai:latest \
+  --model hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 \
+  --quantization gptq_marlin \
+  --tensor-parallel-size 8 \
+  --max-model-len 4096
+```
+To send request to the deployed vLLM endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
+```bash
+curl 0.0.0.0:8000/v1/chat/completions \
+  -X POST \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4",
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "What is Deep Learning?"
+      }
+    ],
+    "max_tokens": 128
+  }'
+```
+Or programatically via the `openai` Python client (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
+```python
+import os
+from openai import OpenAI
+client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))
+chat_completion = client.chat.completions.create(
+  model="hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4",
+  messages=[
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is Deep Learning?"},
+  ],
+  max_tokens=128,
+)
+```
 ## Quantization Reproduction