amd
/

Safetensors
llama
alignment-handbook
Generated from Trainer
Mingyuyang-1 commited on
Commit
3c6f8f4
·
verified ·
1 Parent(s): c231932

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -12,12 +12,15 @@ license: apache-2.0
12
  ---
13
 
14
  # Zebra-Llama: Towards Extremely Efficient Hybrid Models
15
- Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that achieve Transformer-level accuracy with near-State Space Model (SSM) efficiency. While standard Transformers are limited by the quadratic complexity of self-attention and the large memory footprint of their key-value (KV) cache, Zebra-Llama offers a practical and scalable solution.
 
16
 
17
- This model, `Zebra-Llama-1B-4MLA-12M2`, is created by efficiently adapting the pre-trained Llama-3.2-1B-Instruct model. It composes efficient hybrid layers, combining Multi-head Latent Attention (MLA) for KV cache compression and Mamba2 (an SSM) for computational efficiency. This approach bypasses the need for costly pre-training from scratch.
18
-
19
- The composition follows a three-stage pipeline to effectively transfer knowledge from the pre-trained Transformer.
20
 
 
 
 
 
21
 
22
  ## Key Takeaways
23
  - Announcing Zebra-Llama, a family of highly efficient 1B, 3B, and 8B hybrid models created by post-training adaptation of existing state-of-the-art Transformers.
 
12
  ---
13
 
14
  # Zebra-Llama: Towards Extremely Efficient Hybrid Models
15
+ Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that composes Multi-head Latent Attention (MLA) and Mamba2 for KV cache compression and computational efficiency.
16
+ Thus combination achieves Transformer-level accuracy with near-State Space Model (SSM) efficiency. While standard Transformers are limited by the quadratic complexity of self-attention and the large memory footprint of their key-value (KV) cache, Zebra-Llama offers a practical and scalable solution.
17
 
18
+ This model, `Zebra-Llama-1B-4MLA-12M2`, is created by efficiently adapting the pre-trained Llama-3.2-1B-Instruct model conducted post-training on AMD Instinct™ MI300X GPUs. This training approach bypasses the need for costly pre-training from scratch.
 
 
19
 
20
+ <div align="center">
21
+ <img src="scaling_perf_instruct.png" style="object-fit: contain;"/>
22
+ <em><b>Figure 1:</b> Pareto frontier of pre-training tokens vs average performance for pre-trained and instruction-tuned models.</em>
23
+ </div>
24
 
25
  ## Key Takeaways
26
  - Announcing Zebra-Llama, a family of highly efficient 1B, 3B, and 8B hybrid models created by post-training adaptation of existing state-of-the-art Transformers.