amd
/

Zebra-Llama-3B-6MLA-22Mamba-DPO

Safetensors

llama

alignment-handbook

Generated from Trainer

Model card Files Files and versions

xet

Community

ghl commited on Jun 17

Commit

c483dc3

1 Parent(s): 3e1f068

Update README.md

Browse files

Files changed (1) hide show

README.md +23 -22

README.md CHANGED Viewed

@@ -1,5 +1,6 @@
 ---
-base_model: meta-llama/Llama-3.2-3B-Instruct
 datasets:
 - JunxiongWang/sftdatasetv3
 model-index:
@@ -89,31 +90,31 @@ print(tokenizer.decode(tokens[0], skip_special_tokens=False))
 | Model | KV Size | Param | Index of MLA layers | r<sub>kv</sub>| r<sub>q</sub> | d<sub>rope</sub> | d<sub>nope</sub> |
 |-------|--------:|------:|-------------------:|------:|------:|---------:|---------:|
 |Llama-3.2-3B-Instruct |  100% |  3.21B | - | -| -| -| -|
-|Zebra-Llama-3B-14MLA-14M2 |  4.69% |  3.27B | [0,2,4,6,8,10,12,14,16,18,20,22,24,26] | 128 | 1536 | 64 | 64 |
-|Zebra-Llama-3B-6MLA-22M2 |  2.01% |  3.39B | [0,5,10,16,21,26] | 128 | 1536 | 64 | 64 |
 ### Benchmark  results
 Zebra-Llama was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
-| Tasks             | Metric   |  Llama-3.2-3B-Instruct | Zebra-Llama-3B-6MLA-22M2-SFT | Zebra-Llama-3B-6MLA-22M2-DPO |  Zebra-Llama-3B-14MLA-14M2-SFT | Zebra-Llama-3B-14MLA-14M2-DPO |
 |-------------------|----------|------------------: |----: |----:| ----:|----:|
-| arc_challenge     | acc      | 0.4369±0.0145 | 0.3507 (±0.0139) | 0.3976 (±0.0143)| 0.3456 (±0.0139) | 0.3951 (±0.0143)|
-|                   | acc_norm | 0.4590±0.0146 | 0.3908 (±0.0143) | 0.4232 (±0.0144)| 0.3797 (±0.0142)| 0.4249 (±0.0144)|
-| arc_easy          | acc      | 0.7428±0.0090 | 0.7054 (±0.0094) | 0.7226 (±0.0092)| 0.7092 (±0.0093)| 0.7239 (±0.0092)|
-|                   | acc_norm | 0.6776±0.0096 | 0.6536 (±0.0098) | 0.6696 (±0.0097)| 0.6641 (±0.0097)| 0.6726 (±0.0096)|
-| hellaswag         | acc      |  0.5222±0.0050 | 0.4272 (±0.0049) | 0.4399 (±0.005) | 0.4366 (±0.0049)| 0.4527 (±0.0050)|
-|                   | acc_norm | 0.7036±0.0046 | 0.5691 (±0.0049) | 0.5893 (±0.0049) | 0.5816 (±0.0049)| 0.6061 (±0.0049)|
-| mmlu              | acc      | 0.6046±0.1057 | 0.3739 (±0.0736) | 0.3791 (±0.0742)| 0.3940 (±0.0779)| 0.3909 (±0.0756)|
-| - humanities      | acc      | 0.5926±0.0826 | 0.3456 (±0.0583) | 0.3443 (±0.0634)| 0.3694 (±0.0709)| 0.3700 (±0.0684)|
-| - other           | acc      | 0.6598±0.1118 | 0.4184 (±0.0746) | 0.4081 (±0.0707)| 0.4300 (±0.0747)| 0.4258 (±0.0737)|
-| - social_sciences | acc      | 0.6701±0.0712 | 0.4098 (±0.0758) | 0.4303 (±0.0709)| 0.4348 (±0.0749)| 0.4283 (±0.0727)|
-| - stem            | acc      |  0.5043±0.1122  | 0.3375 (±0.0730) | 0.3527 (±0.077)| 0.3555 (±0.0776)| 0.3511 (±0.0746)|
-| openbookqa        | acc      |  0.2740±0.0200 | 0.2800 (±0.0201) | 0.302 (±0.0206)| 0.2480 (±0.0193)| 0.3000 (±0.0205)|
-|                   | acc_norm |   0.3620±0.021 | 0.3700 (±0.0216) | 0.406 (±0.022)| 0.3800 (±0.0217)| 0.4180 (±0.0221)|
-| piqa              | acc      | 0.7606±0.0100| 0.7214 (±0.0105) | 0.7252 (±0.0104)| 0.7252 (±0.0104)| 0.7280 (±0.0104)|
-|                   | acc_norm | 0.7557±0.0100 | 0.7225 (±0.0104) | 0.7296 (±0.0104)| 0.7269 (±0.0104)| 0.7296 (±0.0104)|
-| pubmedqa          | acc      |  0.6960±0.0206 | 0.5760 (±0.0221) | 0.566 (±0.0222)| 0.5940 (±0.0220)| 0.5860 (±0.0220)|
-| race              | acc      |   0.4077±0.0152| 0.3445 (±0.0147) | 0.377 (±0.015)| 0.3694 (±0.0149)| 0.3866 (±0.0151)|
-| winogrande        | acc      | 0.6717±0.0132 | 0.5785 (±0.0139) | 0.5888 (±0.0138)| 0.6125 (±0.0137)| 0.6133 (±0.0137)|
 ## Conclusion
 Zebra-Llama demonstrates a practical and scalable framework for composing highly efficient hybrid models from existing pre-trained Transformers. By intelligently combining MLA and Mamba2 layers, this approach drastically reduces memory requirements and improves inference throughput while preserving the strong capabilities of the original model. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.

 ---
+base_model:
+- meta-llama/Llama-3.2-3B-Instruct
 datasets:
 - JunxiongWang/sftdatasetv3
 model-index:
 | Model | KV Size | Param | Index of MLA layers | r<sub>kv</sub>| r<sub>q</sub> | d<sub>rope</sub> | d<sub>nope</sub> |
 |-------|--------:|------:|-------------------:|------:|------:|---------:|---------:|
 |Llama-3.2-3B-Instruct |  100% |  3.21B | - | -| -| -| -|
+|Zebra-Llama-3B-6MLA-22Mamba |  2.01% |  3.39B | [0,5,10,16,21,26] | 128 | 1536 | 64 | 64 |
+|Zebra-Llama-3B-14MLA-14Mamba |  4.69% |  3.27B | [0,2,4,6,8,10,12,14,16,18,20,22,24,26] | 128 | 1536 | 64 | 64 |
 ### Benchmark  results
 Zebra-Llama was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
+|     Tasks         |  Metric  | Llama-3.2-3B-Instruct | Zebra-Llama-3B-6MLA-22M2-SFT | Zebra-Llama-3B-6MLA-22M2-DPO | Zebra-Llama-3B-14MLA-14M2-SFT | Zebra-Llama-3B-14MLA-14M2-DPO |
 |-------------------|----------|------------------: |----: |----:| ----:|----:|
+| arc_challenge     | acc      |         0.4369±0.0145 |                0.4189±0.0144 |                0.4744±0.0146 |                 0.4326±0.0145 |                 0.4966±0.0146 |
+|                   | acc_norm |          0.459±0.0146 |                0.4539±0.0145 |                0.5077±0.0146 |                 0.4667±0.0146 |                 0.5128±0.0146 |
+| arc_easy          | acc      |          0.7428±0.009 |                0.7677±0.0087 |                0.7984±0.0082 |                  0.771±0.0086 |                 0.7959±0.0083 |
+|                   | acc_norm |         0.6776±0.0096 |                0.7205±0.0092 |                0.7609±0.0088 |                 0.7269±0.0091 |                 0.7614±0.0087 |
+| hellaswag         | acc      |          0.5222±0.005 |                 0.5014±0.005 |                 0.5213±0.005 |                  0.5068±0.005 |                   0.531±0.005 |
+|                   | acc_norm |         0.7036±0.0046 |                0.6812±0.0047 |                0.7146±0.0045 |                 0.6875±0.0046 |                 0.7257±0.0045 |
+| mmlu              | acc      |         0.6046±0.1057 |                 0.5049±0.103 |                0.5006±0.1038 |                  0.528±0.1069 |                  0.521±0.1057 |
+| - humanities      | acc      |         0.5926±0.0826 |                0.4553±0.1006 |                0.4493±0.0948 |                 0.4759±0.1027 |                 0.4721±0.1008 |
+| - other           | acc      |         0.6598±0.1118 |                0.5626±0.0935 |                0.5603±0.0971 |                 0.5919±0.0948 |                 0.5855±0.0916 |
+| - social_sciences | acc      |         0.6701±0.0712 |                 0.5899±0.083 |                 0.584±0.0871 |                 0.6113±0.0848 |                  0.6081±0.081 |
+| - stem            | acc      |         0.5043±0.1122 |                0.4393±0.0894 |                 0.437±0.0959 |                 0.4615±0.1003 |                 0.4456±0.0987 |
+| openbookqa        | acc      |            0.274±0.02 |                 0.278±0.0201 |                  0.324±0.021 |                  0.282±0.0201 |                    0.33±0.021 |
+|                   | acc_norm |          0.362±0.0215 |                 0.386±0.0218 |                 0.434±0.0222 |                  0.394±0.0219 |                  0.424±0.0221 |
+| piqa              | acc      |           0.7606±0.01 |                 0.753±0.0101 |                 0.772±0.0098 |                 0.7617±0.0099 |                 0.7775±0.0097 |
+|                   | acc_norm |           0.7557±0.01 |                  0.7573±0.01 |                0.7726±0.0098 |                   0.7579±0.01 |                 0.7753±0.0097 |
+| pubmedqa          | acc      |          0.696±0.0206 |                 0.612±0.0218 |                  0.62±0.0217 |                  0.648±0.0214 |                   0.63±0.0216 |
+| race              | acc      |         0.4077±0.0152 |                0.3904±0.0151 |                0.4249±0.0153 |                 0.4048±0.0152 |                 0.4593±0.0154 |
+| winogrande        | acc      |         0.6717±0.0132 |                0.6614±0.0133 |                0.6646±0.0133 |                 0.6598±0.0133 |                 0.6756±0.0132 |
 ## Conclusion
 Zebra-Llama demonstrates a practical and scalable framework for composing highly efficient hybrid models from existing pre-trained Transformers. By intelligently combining MLA and Mamba2 layers, this approach drastically reduces memory requirements and improves inference throughput while preserving the strong capabilities of the original model. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.