daniicruzz commited on
Commit
6060448
·
verified ·
1 Parent(s): 2c5b9ff

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -8
README.md CHANGED
@@ -184,13 +184,14 @@ Reported numbers use the methodology described above.
184
  ### Quantitative Results (Inference Performance)
185
 
186
  #### Metrics reported
187
- - **System Output Throughput**: Mean output tokens per second across all concurrent requests over the benchmarking phase.
188
- - **End-to-End Latency per Query:** Median end-to-end response time for each query from the time the query is sent.
189
- - **Output Speed per Query:** Median output tokens per second after the first token is received for each query.
190
- - **Time to first token (TTFT):** Median
191
- - **Estimated Peak Memory Usage:** KV cache utilization is monitored during the phase and we estimate memory usage as follows: $model\_ weights_{gb} + kv\_ cache_{usage\_pct} × (nvml\_used_{gb} − model\_ weights_{gb})$
192
- - **Model weights:**
193
- **Summary of improvements:** Little Lamb shows a slight improvement in performance with respect to the original Qwen Model. This is expected as for such small models, VRAM usage is dominated by KV cache and not model weights.
 
194
 
195
  #### Performance evaluation conditions
196
 
@@ -205,7 +206,7 @@ Our performance evaluation follows the spirit of [Artificial Analysis](https://a
205
  - **Streaming**: Benchmarking is conducted with streaming enabled.
206
 
207
 
208
- **Summary of improvements:** Little Lamb shows a slight improvement in performance with respect to the original Qwen Model. This is expected as for such small models, VRAM usage is dominated by KV cache and not model weights.
209
 
210
  ![Performance](assets/littlelamb-performance-family.png)
211
 
 
184
  ### Quantitative Results (Inference Performance)
185
 
186
  #### Metrics reported
187
+ - **System Output Throughput (higher is better)**: Mean output tokens per second across all concurrent requests over the benchmarking phase.
188
+ - **End-to-End Latency per Query (lower is better):** Median end-to-end response time for each query from the time the query is sent.
189
+ - **Output Speed per Query (higher is better):** Median output tokens per second after the first token is received for each query.
190
+ - **Time to first token (TTFT) (lower is better):** Median
191
+ - **Estimated Peak Memory Usage (lower is better):** KV cache utilization is monitored during the phase and we estimate memory usage as follows: $model\_ weights_{gb} + kv\_ cache_{usage\_pct} × (nvml\_used_{gb} − model\_ weights_{gb})$
192
+ - **Model weights (lower is better):**
193
+
194
+ **Summary of improvements:** LittleLamb shows a slight improvement in performance with respect to the original Qwen Model. This is expected as for such small models, VRAM usage is dominated by KV cache and not model weights.
195
 
196
  #### Performance evaluation conditions
197
 
 
206
  - **Streaming**: Benchmarking is conducted with streaming enabled.
207
 
208
 
209
+ **Summary of improvements:** LittleLamb shows a slight improvement in performance with respect to the original Qwen Model. This is expected as for such small models, VRAM usage is dominated by KV cache and not model weights.
210
 
211
  ![Performance](assets/littlelamb-performance-family.png)
212