Update README.md
Browse files
README.md
CHANGED
|
@@ -32,7 +32,7 @@ This model can be deployed efficiently using [vLLM](https://docs.vllm.ai/en/late
|
|
| 32 |
|
| 33 |
Run the following command to start the vLLM server:
|
| 34 |
```bash
|
| 35 |
-
vllm serve
|
| 36 |
```
|
| 37 |
|
| 38 |
Once your server is started, you can query the model using the OpenAI API:
|
|
@@ -55,7 +55,7 @@ TITLE: Training sparse LLMs
|
|
| 55 |
|
| 56 |
POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
|
| 57 |
|
| 58 |
-
It's super easy to use. See the example in https://huggingface.co/
|
| 59 |
|
| 60 |
And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
|
| 61 |
"""
|
|
@@ -63,7 +63,7 @@ And there's more. You can run 2:4 sparse models on vLLM and get significant spee
|
|
| 63 |
prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
|
| 64 |
|
| 65 |
completion = client.completions.create(
|
| 66 |
-
model="
|
| 67 |
prompt=prompt,
|
| 68 |
max_tokens=256,
|
| 69 |
)
|
|
@@ -202,7 +202,7 @@ The model was evaluated on the test split of [trl-lib/tldr](https://huggingface.
|
|
| 202 |
One can reproduce these results by using the following command:
|
| 203 |
|
| 204 |
```bash
|
| 205 |
-
lm_eval --model vllm --model_args "pretrained=
|
| 206 |
```
|
| 207 |
|
| 208 |
<table>
|
|
@@ -265,8 +265,8 @@ Benchmarking was conducted with [vLLM](https://docs.vllm.ai/en/latest/) version
|
|
| 265 |
|
| 266 |
The figure below presents the **mean end-to-end latency per request** across varying request rates.
|
| 267 |
Results are shown for this model, as well as two variants:
|
| 268 |
-
- **Sparse:** [Sparse-Llama-3.1-8B-tldr-2of4](https://huggingface.co/
|
| 269 |
-
- **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/
|
| 270 |
|
| 271 |

|
| 272 |
|
|
@@ -284,7 +284,7 @@ ds.to_json("tldr_1000.json")
|
|
| 284 |
|
| 285 |
2. Start a vLLM server using your target model:
|
| 286 |
```bash
|
| 287 |
-
vllm serve
|
| 288 |
```
|
| 289 |
|
| 290 |
3. Run the benchmark with GuideLLM:
|
|
|
|
| 32 |
|
| 33 |
Run the following command to start the vLLM server:
|
| 34 |
```bash
|
| 35 |
+
vllm serve RedHatAI/Llama-3.1-8B-tldr
|
| 36 |
```
|
| 37 |
|
| 38 |
Once your server is started, you can query the model using the OpenAI API:
|
|
|
|
| 55 |
|
| 56 |
POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
|
| 57 |
|
| 58 |
+
It's super easy to use. See the example in https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4.
|
| 59 |
|
| 60 |
And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
|
| 61 |
"""
|
|
|
|
| 63 |
prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
|
| 64 |
|
| 65 |
completion = client.completions.create(
|
| 66 |
+
model="RedHatAI/Llama-3.1-8B-tldr",
|
| 67 |
prompt=prompt,
|
| 68 |
max_tokens=256,
|
| 69 |
)
|
|
|
|
| 202 |
One can reproduce these results by using the following command:
|
| 203 |
|
| 204 |
```bash
|
| 205 |
+
lm_eval --model vllm --model_args "pretrained=RedHatAI/Llama-3.1-8B-tldr,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
|
| 206 |
```
|
| 207 |
|
| 208 |
<table>
|
|
|
|
| 265 |
|
| 266 |
The figure below presents the **mean end-to-end latency per request** across varying request rates.
|
| 267 |
Results are shown for this model, as well as two variants:
|
| 268 |
+
- **Sparse:** [Sparse-Llama-3.1-8B-tldr-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4)
|
| 269 |
+
- **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic)
|
| 270 |
|
| 271 |

|
| 272 |
|
|
|
|
| 284 |
|
| 285 |
2. Start a vLLM server using your target model:
|
| 286 |
```bash
|
| 287 |
+
vllm serve RedHatAI/Llama-3.1-8B-tldr
|
| 288 |
```
|
| 289 |
|
| 290 |
3. Run the benchmark with GuideLLM:
|