amd
/

Zebra-Llama-1B-4MLA-12Mamba-DPO

alignment-handbook

Generated from Trainer

Model card Files Files and versions

Mingyuyang-1 commited on Jun 13

Commit

3c6f8f4

·

verified ·

1 Parent(s): c231932

Update README.md

Files changed (1) hide show

README.md +7 -4

README.md CHANGED Viewed

@@ -12,12 +12,15 @@ license: apache-2.0
 ---
 # Zebra-Llama: Towards Extremely Efficient Hybrid Models
-Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that achieve Transformer-level accuracy with near-State Space Model (SSM) efficiency. While standard Transformers are limited by the quadratic complexity of self-attention and the large memory footprint of their key-value (KV) cache, Zebra-Llama offers a practical and scalable solution.
-This model, `Zebra-Llama-1B-4MLA-12M2`, is created by efficiently adapting the pre-trained Llama-3.2-1B-Instruct model. It composes efficient hybrid layers, combining Multi-head Latent Attention (MLA) for KV cache compression and Mamba2 (an SSM) for computational efficiency. This approach bypasses the need for costly pre-training from scratch.
-The composition follows a three-stage pipeline to effectively transfer knowledge from the pre-trained Transformer.
 ## Key Takeaways
 - Announcing Zebra-Llama, a family of highly efficient 1B, 3B, and 8B hybrid models created by post-training adaptation of existing state-of-the-art Transformers.

 ---
 # Zebra-Llama: Towards Extremely Efficient Hybrid Models
+Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that composes Multi-head Latent Attention (MLA) and Mamba2 for KV cache compression and computational efficiency.
+Thus combination achieves Transformer-level accuracy with near-State Space Model (SSM) efficiency. While standard Transformers are limited by the quadratic complexity of self-attention and the large memory footprint of their key-value (KV) cache, Zebra-Llama offers a practical and scalable solution.
+This model, `Zebra-Llama-1B-4MLA-12M2`, is created by efficiently adapting the pre-trained Llama-3.2-1B-Instruct model conducted post-training on AMD Instinct&trade; MI300X GPUs. This training approach bypasses the need for costly pre-training from scratch.
+<div align="center">
+<img src="scaling_perf_instruct.png" style="object-fit: contain;"/>
+<em><b>Figure 1:</b> Pareto frontier of pre-training tokens vs average performance for pre-trained and instruction-tuned models.</em>
+</div>
 ## Key Takeaways
 - Announcing Zebra-Llama, a family of highly efficient 1B, 3B, and 8B hybrid models created by post-training adaptation of existing state-of-the-art Transformers.