amd
/

Zebra-Llama-1B-8MLA-8Mamba-SFT

@@ -3,65 +3,140 @@ base_model: meta-llama/Llama-3.2-1B-Instruct
 datasets:
 - JunxiongWang/sftdatasetv3
 model-index:
-- name: HybridInLlama_mla50_mamba50_1B8B_uniform_stage2
   results: []
 tags:
 - alignment-handbook
 - generated_from_trainer
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/mingyuy-university-of-michigan/huggingface/runs/4080enjr)
-# HybridInLlama_mla50_mamba50_1B8B_uniform_stage2
-This model is a fine-tuned version of [/home/amd-user/mingyyan/checkpoints/llama3.2_1b-instruct](https://huggingface.co//home/amd-user/mingyyan/checkpoints/llama3.2_1b-instruct) on the /home/amd-user/mingyyan/data/sftdatasetv3 dataset.
-It achieves the following results on the evaluation set:
-- Loss: 480.8432
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 8e-05
-- train_batch_size: 12
-- eval_batch_size: 2
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 8
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 192
-- total_eval_batch_size: 16
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.01
-- num_epochs: 1
-### Training results
-| Training Loss | Epoch | Step  | Validation Loss |
-|:-------------:|:-----:|:-----:|:---------------:|
-| 391.9548      | 1.0   | 13865 | 480.8432        |
-### Framework versions
-- Transformers 4.43.1
-- Pytorch 2.7.0a0+git6374332
-- Datasets 2.21.0
-- Tokenizers 0.19.1

 datasets:
 - JunxiongWang/sftdatasetv3
 model-index:
+- name: Zebra-Llama-1B-8MLA-8Mamba-SFT
   results: []
 tags:
 - alignment-handbook
 - generated_from_trainer
+license: apache-2.0
 ---
+# Zebra-Llama: Towards Extremely Efficient Hybrid Models
+Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that composes Multi-head Latent Attention (MLA) and Mamba2 for KV cache compression and computational efficiency.
+Thus combination achieves Transformer-level accuracy with near-State Space Model (SSM) efficiency. While standard Transformers are limited by the quadratic complexity of self-attention and the large memory footprint of their key-value (KV) cache, Zebra-Llama offers a practical and scalable solution.
+This model, `Zebra-Llama-1B-4MLA-12M2-SFT`, is created by efficiently adapting the pre-trained `Llama-3.2-1B-Instruct` model conducted post-training on AMD Instinct&trade; MI300X GPUs. This training approach bypasses the need for costly pre-training from scratch.
+<div align="center">
+<img src="comparison.png" width="570" height="380" style="object-fit: contain;"/>
+<em><b>Figure 1:</b> Comparing 8B-scale models on average LM Harness score vs. KV cache size. Zebra-Llama (green) matches or exceeds baselines with smaller KV cache and fewer training tokens. Circle and square sizes indicate training tokens (billions for post-training, trillions for pre-training).</em>
+</div>
+## Key Takeaways
+- Announcing Zebra-Llama, a family of highly efficient 1B, 3B, and 8B hybrid models created by post-training adaptation of existing state-of-the-art Transformers.
+- Extreme KV Cache Compression: Zebra-Llama dramatically reduces the KV cache size to 2%-4% of the original Llama model while preserving 100% of its average zero-shot performance on LM Harness tasks.
+- Efficient Hybrid Architecture: Zebra-Llama strategically combines Multi-head Latent Attention (MLA) layers, which compress the KV cache, and Mamba2 (SSM) layers, which eliminate the KV cache entirely, to balance memory usage and performance.
+- Novel Post-Training Pipeline: Zebra-Llama employs an efficient post-training pipeline featuring refined weight initialization, Intermediate Layer Distillation (ILD) for knowledge transfer, and a sensitivity-aware strategy (SMART) for optimal hybrid composition.
+## Model Composition Pipeline
+The Zebra-Llama models are not trained from scratch. Instead, they are composed from powerful pre-trained Transformers through a lightweight and efficient pipeline. The creation of this model followed these stages:
+| Stage             | Action                                | Description                                                                                                                                                                                  |
+|-------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| 1. Base Model     | Llama-3.2-1B-Instruct                 | The starting point is a high-quality, pre-trained Transformer model.                                                                                                                         |
+| 2. Initialization | Structured Weight Mapping             | Pure Mamba2 and MLA models are initialized from the base model's weights using structured mapping techniques (SVD for MLA, reinterpretation for Mamba2).                                     |
+| 3. Refinement     | Intermediate Layer Distillation (ILD) | The internal representations of the Mamba2 and MLA models are aligned with the base model's layers on a small dataset to ensure a strong starting point.                                     |
+| 4. Composition    | SMART Layer Selection                 | A hybrid architecture is composed using the SMART (Sensitivity Measure-Aware Replacement of Transformer layers) strategy to optimally place each layer type. |
+| 5. SFT            | End-to-End Knowledge Distillation     | The composed hybrid model is fine-tuned via knowledge distillation, using an 8B model as a teacher to transfer rich, pre-trained knowledge.                                                  |
+| 6. Alignment      | Direct Preference Optimization (DPO)  | In the final stage, DPO is used to align the model's preferences, with the distilled student model itself serving as the reference model for stability.                                      |
+## Getting Started
+### Installation
+```
+git clone https://github.com/AMD-AIG-AIMA/AMD-Hybrid-Models.git
+```
+Then follow the installation instruction in `AMD-AIG-AIMA/AMD-Hybrid-Models` repo.
+### Example Usage
+Once the installation completed, we can try the following code for a quick test
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from hybrid_model.hybrid_model_wrapper import HybridTransformerHybridModelWrapper
+checkpoint = "amd/Zebra-Llama-1B-8MLA-8Mamba-SFT"
+model = HybridModelWrapper.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).cuda()
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model.eval()
+# Format the prompt using the chat template
+prompt = [{"role": "user", "content": "What are the benefits of hybrid language models?"}]
+input_ids = tokenizer.apply_chat_template(
+    prompt,
+    add_generation_prompt=True,
+    return_tensors='pt'
+).cuda()
+# Generate a response
+tokens = model.generate(
+    input_ids,
+    max_new_tokens=256,
+    temperature=0.7,
+    do_sample=True,
+    eos_token_id=tokenizer.eos_token_id
+)
+print(tokenizer.decode(tokens[0], skip_special_tokens=False))
+```
+### Model details
+| Model | KV Size | Param | Index of MLA layers | r<sub>kv</sub>| r<sub>q</sub> | d<sub>rope</sub> | d<sub>nope</sub> |
+|-------|--------:|------:|-------------------:|------:|------:|---------:|---------:|
+|Llama-3.2-1B-Instruct |  100% |  1.24B | - | -| -| -| -|
+|Zebra-Llama-1B-8MLA-8M2 |  7.81% |  1.27B | [0,2,4,6,8,10,12,14] | 128 | 1344 | 32 | 32 |
+|Zebra-Llama-1B-4MLA-12M2 |  3.91% |  1.28B | [0,5,10,14] | 128 | 1344 | 32 | 32 |
+### Benchmark  results
+Zebra-Llama was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
+| Tasks             | Metric   |  Llama-3.2-1B-Instruct | Zebra-Llama-1B-4MLA-12M2-SFT | Zebra-Llama-1B-4MLA-12M2-DPO |  Zebra-Llama-1B-8MLA-8M2-SFT | Zebra-Llama-1B-8MLA-8M2-DPO |
+|-------------------|----------|------------------: |----: |----:| ----:|----:|
+| arc_challenge     | acc      | 0.3575 (±0.0140) | 0.3507 (±0.0139) | 0.3976 (±0.0143)| 0.3456 (±0.0139) | 0.3951 (±0.0143)|
+|                   | acc_norm | 0.3797 (±0.0142) | 0.3908 (±0.0143) | 0.4232 (±0.0144)| 0.3797 (±0.0142)| 0.4249 (±0.0144)|
+| arc_easy          | acc      | 0.6843 (±0.0095) | 0.7054 (±0.0094) | 0.7226 (±0.0092)| 0.7092 (±0.0093)| 0.7239 (±0.0092)|
+|                   | acc_norm | 0.6351 (±0.0099) | 0.6536 (±0.0098) | 0.6696 (±0.0097)| 0.6641 (±0.0097)| 0.6726 (±0.0096)|
+| hellaswag         | acc      |  0.4506 (±0.005) | 0.4272 (±0.0049) | 0.4399 (±0.005) | 0.4366 (±0.0049)| 0.4527 (±0.0050)|
+|                   | acc_norm | 0.6077 (±0.0049) | 0.5691 (±0.0049) | 0.5893 (±0.0049) | 0.5816 (±0.0049)| 0.6061 (±0.0049)|
+| mmlu              | acc      | 0.4609 (±0.0918) | 0.3739 (±0.0736) | 0.3791 (±0.0742)| 0.3940 (±0.0779)| 0.3909 (±0.0756)|
+| - humanities      | acc      | 0.4397 (±0.0763) | 0.3456 (±0.0583) | 0.3443 (±0.0634)| 0.3694 (±0.0709)| 0.3700 (±0.0684)|
+| - other           | acc      | 0.5204 (±0.0868) | 0.4184 (±0.0746) | 0.4081 (±0.0707)| 0.4300 (±0.0747)| 0.4258 (±0.0737)|
+| - social_sciences | acc      | 0.5109 (±0.0843) | 0.4098 (±0.0758) | 0.4303 (±0.0709)| 0.4348 (±0.0749)| 0.4283 (±0.0727)|
+| - stem            | acc      |  0.3850 (±0.09)  | 0.3375 (±0.0730) | 0.3527 (±0.077)| 0.3555 (±0.0776)| 0.3511 (±0.0746)|
+| openbookqa        | acc      |  0.244 (±0.0192) | 0.2800 (±0.0201) | 0.302 (±0.0206)| 0.2480 (±0.0193)| 0.3000 (±0.0205)|
+|                   | acc_norm |   0.35 (±0.0214) | 0.3700 (±0.0216) | 0.406 (±0.022)| 0.3800 (±0.0217)| 0.4180 (±0.0221)|
+| piqa              | acc      | 0.7405 (±0.0102) | 0.7214 (±0.0105) | 0.7252 (±0.0104)| 0.7252 (±0.0104)| 0.7280 (±0.0104)|
+|                   | acc_norm | 0.7437 (±0.0102) | 0.7225 (±0.0104) | 0.7296 (±0.0104)| 0.7269 (±0.0104)| 0.7296 (±0.0104)|
+| pubmedqa          | acc      |  0.602 (±0.0219) | 0.5760 (±0.0221) | 0.566 (±0.0222)| 0.5940 (±0.0220)| 0.5860 (±0.0220)|
+| race              | acc      |   0.3809 (±0.015)| 0.3445 (±0.0147) | 0.377 (±0.015)| 0.3694 (±0.0149)| 0.3866 (±0.0151)|
+| winogrande        | acc      | 0.5967 (±0.0138) | 0.5785 (±0.0139) | 0.5888 (±0.0138)| 0.6125 (±0.0137)| 0.6133 (±0.0137)|
+## Conclusion
+Zebra-Llama demonstrates a practical and scalable framework for composing highly efficient hybrid models from existing pre-trained Transformers. By intelligently combining MLA and Mamba2 layers, this approach drastically reduces memory requirements and improves inference throughput while preserving the strong capabilities of the original model. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.
+## Bias, Risks, and Limitations
+- This model is a research artifact and has not been evaluated for safety in production use cases.
+- The model's performance is dependent on the quality of its pre-trained base model and the teacher model used during distillation. Its capabilities and biases are inherited from these sources.
+- The model may generate content that is factually inaccurate, biased, or otherwise objectionable. Users should be aware of these risks and implement appropriate safeguards for their applications.
+- One limitation of this work is the reliance on a strong teacher model for knowledge transfer, which may not always be available. Distillation from a teacher also adds to the resource requirements during the post-training phase.
+## Citation
+If you find this model useful, please consider citing the original paper:
+```
+@article{yang2025zebra,
+  title={Zebra-Llama: Towards Extremely Efficient Hybrid Models},
+  author={Yang, Mingyu and Rezagholizadeh, Mehdi and Li, Guihong and Appia, Vikram and Barsoum, Emad},
+  journal={arXiv preprint arXiv:2505.17272},
+  year={2025}
+}
+@article{li2025x,
+  title={X-ecomla: Upcycling pre-trained attention into mla for efficient and extreme kv compression},
+  author={Li, Guihong and Rezagholizadeh, Mehdi and Yang, Mingyu and Appia, Vikram and Barsoum, Emad},
+  journal={arXiv preprint arXiv:2503.11132},
+  year={2025}
+}
+```