File size: 2,659 Bytes
3082835
 
 
 
9c6ba67
 
 
 
 
 
 
 
 
 
 
39e7c16
 
 
 
 
 
 
 
9c6ba67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
license: llama3.1
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---

# SwiftKV 

The Snowflake AI Research team is releasing a series of SwiftKV optimized Llama-3.1 models. [SwiftKV](https://arxiv.org/abs/2410.03960) is a series of inference optimizations that goes beyond traditional key-value (KV) cache compression. This method reduces computational overhead during prompt processing by combining model rewiring and knowledge-preserving self-distillation, allowing prefill tokens to skip up to half the model's layers. SwiftKV achieves up to 2x improvements in throughput, latency, and cost efficiency with minimal accuracy loss, making LLM deployments more performant and economically viable.

For more details about SwiftKV and how to use it: 
* โ„๏ธ [SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction (blog)](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/)
* ๐Ÿ“ [SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation (arXiv)](https://arxiv.org/abs/2410.03960)
* ๐Ÿš€ [Getting started guide](https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv)

## Performance Metrics

Combined input and output throughput for Llama 3.1 405B across a range of input lengths.
<img src="figure-4.png" alt="performance plot of llama-405B w. swiftkv" width="400">
Legend: blue - baseline FP8, pink - SwiftKV FP8<br>



## Eval Metrics

For a full breakdown on evaluation metrics and performance impact please refer to our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/) and [arXiv paper]((https://arxiv.org/abs/2410.03960)) but below we've outlined some relevant evaluation metrics.

| Llama-3.1-405B-Instruct-FP8 | Arc Challenge | Winogrande | HellaSwag | TruthfulQA | MMLU | MMLU cot | GSM8K | Avg |
|-----------|---------------|------------|-----------|------------|------|----------|-------|-----|
| Baseline | 94.7 | 87.0 | 88.3 | 64.7 | 87.5 | 88.1 | 96.1 | **86.6** |
| 50% SingleInputKV | 94.0 | 86.3 | 88.1 | 64.2 | 85.7 | 87.5 | 95.2 | **85.9** |

| Llama-3.1-8B-Instruct | Arc Challenge | Winogrande | HellaSwag | TruthfulQA | MMLU | MMLU cot | GSM8K | Avg |
|-----------|---------------|------------|-----------|------------|------|----------|-------|-----|
| Baseline | 82.00 | 77.90 | 80.40 | 54.56 | 67.90 | 70.63 | 82.56 | **73.71** |
| 50% SingleInputKV | 80.38 | 78.22 | 79.30 | 54.54 | 67.30 | 69.73 | 79.45 | **72.70** |

## Getting Started

Instructions on how to use vLLM for both evaluation and performance benchmarks:
https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv