alexmarques commited on
Commit
708acde
·
verified ·
1 Parent(s): b50158e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -32,7 +32,7 @@ This model can be deployed efficiently using [vLLM](https://docs.vllm.ai/en/late
32
 
33
  Run the following command to start the vLLM server:
34
  ```bash
35
- vllm serve nm-testing/Llama-3.1-8B-tldr
36
  ```
37
 
38
  Once your server is started, you can query the model using the OpenAI API:
@@ -55,7 +55,7 @@ TITLE: Training sparse LLMs
55
 
56
  POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
57
 
58
- It's super easy to use. See the example in https://huggingface.co/nm-testing/Sparse-Llama-3.1-8B-tldr-2of4.
59
 
60
  And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
61
  """
@@ -63,7 +63,7 @@ And there's more. You can run 2:4 sparse models on vLLM and get significant spee
63
  prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
64
 
65
  completion = client.completions.create(
66
- model="nm-testing/Llama-3.1-8B-tldr",
67
  prompt=prompt,
68
  max_tokens=256,
69
  )
@@ -202,7 +202,7 @@ The model was evaluated on the test split of [trl-lib/tldr](https://huggingface.
202
  One can reproduce these results by using the following command:
203
 
204
  ```bash
205
- lm_eval --model vllm --model_args "pretrained=nm-testing/Llama-3.1-8B-tldr,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
206
  ```
207
 
208
  <table>
@@ -265,8 +265,8 @@ Benchmarking was conducted with [vLLM](https://docs.vllm.ai/en/latest/) version
265
 
266
  The figure below presents the **mean end-to-end latency per request** across varying request rates.
267
  Results are shown for this model, as well as two variants:
268
- - **Sparse:** [Sparse-Llama-3.1-8B-tldr-2of4](https://huggingface.co/nm-testing/Sparse-Llama-3.1-8B-tldr-2of4)
269
- - **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/nm-testing/Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic)
270
 
271
  ![Latency](./inference_performance/latency.png)
272
 
@@ -284,7 +284,7 @@ ds.to_json("tldr_1000.json")
284
 
285
  2. Start a vLLM server using your target model:
286
  ```bash
287
- vllm serve nm-testing/Llama-3.1-8B-tldr
288
  ```
289
 
290
  3. Run the benchmark with GuideLLM:
 
32
 
33
  Run the following command to start the vLLM server:
34
  ```bash
35
+ vllm serve RedHatAI/Llama-3.1-8B-tldr
36
  ```
37
 
38
  Once your server is started, you can query the model using the OpenAI API:
 
55
 
56
  POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
57
 
58
+ It's super easy to use. See the example in https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4.
59
 
60
  And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
61
  """
 
63
  prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
64
 
65
  completion = client.completions.create(
66
+ model="RedHatAI/Llama-3.1-8B-tldr",
67
  prompt=prompt,
68
  max_tokens=256,
69
  )
 
202
  One can reproduce these results by using the following command:
203
 
204
  ```bash
205
+ lm_eval --model vllm --model_args "pretrained=RedHatAI/Llama-3.1-8B-tldr,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
206
  ```
207
 
208
  <table>
 
265
 
266
  The figure below presents the **mean end-to-end latency per request** across varying request rates.
267
  Results are shown for this model, as well as two variants:
268
+ - **Sparse:** [Sparse-Llama-3.1-8B-tldr-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4)
269
+ - **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic)
270
 
271
  ![Latency](./inference_performance/latency.png)
272
 
 
284
 
285
  2. Start a vLLM server using your target model:
286
  ```bash
287
+ vllm serve RedHatAI/Llama-3.1-8B-tldr
288
  ```
289
 
290
  3. Run the benchmark with GuideLLM: