astronomer
/

Llama-3-8B-Instruct-GPTQ-8-Bit

Text Generation

Inference Endpoints

text-generation-inference

8-bit precision

Model card Files Files and versions Community

davidxmle commited on Apr 19

Commit

67deaf4

•

1 Parent(s): 42ee933

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -82,7 +82,7 @@ Tested serving this model via vLLM using an Nvidia T4 (16GB VRAM).
 Tested with the below command
 ```
-python -m vllm.entrypoints.openai.api_server --model Llama-3-8B-Instruct-GPTQ-8-Bit --port 8123 --max-model-len 8192 --dtype float16
 ```
 For the non-stop token generation bug, make sure to send requests with `stop_token_ids":[128001, 128009]` to vLLM endpoint
 Example:

 Tested with the below command
 ```
+python -m vllm.entrypoints.openai.api_server --model astronomer-io/Llama-3-8B-Instruct-GPTQ-8-Bit --max-model-len 8192 --dtype float16
 ```
 For the non-stop token generation bug, make sure to send requests with `stop_token_ids":[128001, 128009]` to vLLM endpoint
 Example: