daniicruzz commited on
Commit
e7350ce
·
verified ·
1 Parent(s): 91ed35d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -251,12 +251,12 @@ BFCL V4 is the de facto industry standard for evaluating function-calling (tool-
251
  ### Quantitative Results (Inference Performance)
252
 
253
  #### Metrics reported
254
- - **System Output Throughput**: Mean output tokens per second across all concurrent requests over the benchmarking phase.
255
- - **End-to-End Latency per Query:** Median end-to-end response time for each query from the time the query is sent.
256
- - **Output Speed per Query:** Median output tokens per second after the first token is received for each query.
257
- - **Time to first token (TTFT):** Median
258
- - **Estimated Peak Memory Usage:** KV cache utilization is monitored during the phase and we estimate memory usage as follows: $model\_ weights_{gb} + kv\_ cache_{usage\_pct} × (nvml\_used_{gb} − model\_ weights_{gb})$
259
- - **Model weights:**
260
 
261
 
262
 
@@ -273,7 +273,7 @@ Our performance evaluation follows the spirit of [Artificial Analysis](https://a
273
  - **Streaming**: Benchmarking is conducted with streaming enabled.
274
 
275
 
276
- **Summary of improvements:** Little Lamb shows a slight improvement in performance with respect to the original Qwen Model. This is expected as for such small models, VRAM usage is dominated by KV cache and not model weights.
277
 
278
  ![Performance](assets/littlelamb-tc-performance-family.png)
279
 
 
251
  ### Quantitative Results (Inference Performance)
252
 
253
  #### Metrics reported
254
+ - **System Output Throughput (higher is better)**: Mean output tokens per second across all concurrent requests over the benchmarking phase.
255
+ - **End-to-End Latency per Query (lower is better):** Median end-to-end response time for each query from the time the query is sent.
256
+ - **Output Speed per Query (higher is better):** Median output tokens per second after the first token is received for each query.
257
+ - **Time to first token (TTFT) (lower is better):** Median
258
+ - **Estimated Peak Memory Usage (lower is better):** KV cache utilization is monitored during the phase and we estimate memory usage as follows: $model\_ weights_{gb} + kv\_ cache_{usage\_pct} × (nvml\_used_{gb} − model\_ weights_{gb})$
259
+ - **Model weights (lower is better):**
260
 
261
 
262
 
 
273
  - **Streaming**: Benchmarking is conducted with streaming enabled.
274
 
275
 
276
+ **Summary of improvements:** LittleLamb shows a slight improvement in performance with respect to the original Qwen Model. This is expected as for such small models, VRAM usage is dominated by KV cache and not model weights.
277
 
278
  ![Performance](assets/littlelamb-tc-performance-family.png)
279