Update README.md
Browse files
README.md
CHANGED
@@ -94,8 +94,9 @@ print(tokenizer.decode(response, skip_special_tokens=True))
|
|
94 |
## Inference Server Hosting Example
|
95 |
```bash
|
96 |
pip install vllm
|
97 |
-
vllm serve scb10x/llama3.1-typhoon2-70b-instruct --tensor-parallel-size 2
|
98 |
-
# using at least 2 80GB gpu for hosting 70b model
|
|
|
99 |
# see more information at https://docs.vllm.ai/
|
100 |
```
|
101 |
|
|
|
94 |
## Inference Server Hosting Example
|
95 |
```bash
|
96 |
pip install vllm
|
97 |
+
vllm serve scb10x/llama3.1-typhoon2-70b-instruct --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --max-model-len 16384
|
98 |
+
# using at least 2 80GB gpu eg A100, H100 for hosting 70b model
|
99 |
+
# to serving longer context, 4-8 gpu is required
|
100 |
# see more information at https://docs.vllm.ai/
|
101 |
```
|
102 |
|