RedHatAI
/

Llama-3.1-8B-tldr

@@ -32,7 +32,7 @@ This model can be deployed efficiently using [vLLM](https://docs.vllm.ai/en/late
 Run the following command to start the vLLM server:
 ```bash
-vllm serve nm-testing/Llama-3.1-8B-tldr
 ```
 Once your server is started, you can query the model using the OpenAI API:
@@ -55,7 +55,7 @@ TITLE: Training sparse LLMs
 POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
-It's super easy to use. See the example in https://huggingface.co/nm-testing/Sparse-Llama-3.1-8B-tldr-2of4.
 And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
 """
@@ -63,7 +63,7 @@ And there's more. You can run 2:4 sparse models on vLLM and get significant spee
 prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
 completion = client.completions.create(
-  model="nm-testing/Llama-3.1-8B-tldr",
   prompt=prompt,
   max_tokens=256,
 )
@@ -202,7 +202,7 @@ The model was evaluated on the test split of [trl-lib/tldr](https://huggingface.
 One can reproduce these results by using the following command:
 ```bash
-lm_eval --model vllm --model_args "pretrained=nm-testing/Llama-3.1-8B-tldr,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
 ```
 <table>
@@ -265,8 +265,8 @@ Benchmarking was conducted with [vLLM](https://docs.vllm.ai/en/latest/) version
 The figure below presents the **mean end-to-end latency per request** across varying request rates.
 Results are shown for this model, as well as two variants:
-- **Sparse:** [Sparse-Llama-3.1-8B-tldr-2of4](https://huggingface.co/nm-testing/Sparse-Llama-3.1-8B-tldr-2of4)
-- **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/nm-testing/Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic)
 ![Latency](./inference_performance/latency.png)
@@ -284,7 +284,7 @@ ds.to_json("tldr_1000.json")
 2. Start a vLLM server using your target model:
 ```bash
-vllm serve nm-testing/Llama-3.1-8B-tldr
 ```
 3. Run the benchmark with GuideLLM:

 Run the following command to start the vLLM server:
 ```bash
+vllm serve RedHatAI/Llama-3.1-8B-tldr
 ```
 Once your server is started, you can query the model using the OpenAI API:
 POST: Now you can use the llm-compressor integration to axolotl to train sparse LLMs!
+It's super easy to use. See the example in https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4.
 And there's more. You can run 2:4 sparse models on vLLM and get significant speedupts on Hopper GPUs!
 """
 prompt = f"Give a TL;DR of the following Reddit post.\n<|user|>{post}\nTL;DR:\n<|assistant|>\n"
 completion = client.completions.create(
+  model="RedHatAI/Llama-3.1-8B-tldr",
   prompt=prompt,
   max_tokens=256,
 )
 One can reproduce these results by using the following command:
 ```bash
+lm_eval --model vllm --model_args "pretrained=RedHatAI/Llama-3.1-8B-tldr,dtype=auto,add_bos_token" --batch-size auto --tasks tldr
 ```
 <table>
 The figure below presents the **mean end-to-end latency per request** across varying request rates.
 Results are shown for this model, as well as two variants:
+- **Sparse:** [Sparse-Llama-3.1-8B-tldr-2of4](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4)
+- **Sparse-quantized:** [Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic](https://huggingface.co/RedHatAI/Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic)
 ![Latency](./inference_performance/latency.png)
 2. Start a vLLM server using your target model:
 ```bash
+vllm serve RedHatAI/Llama-3.1-8B-tldr
 ```
 3. Run the benchmark with GuideLLM: