amd
/

Safetensors
llama
alignment-handbook
Generated from Trainer
ghl commited on
Commit
c483dc3
·
1 Parent(s): 3e1f068

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -22
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
- base_model: meta-llama/Llama-3.2-3B-Instruct
 
3
  datasets:
4
  - JunxiongWang/sftdatasetv3
5
  model-index:
@@ -89,31 +90,31 @@ print(tokenizer.decode(tokens[0], skip_special_tokens=False))
89
  | Model | KV Size | Param | Index of MLA layers | r<sub>kv</sub>| r<sub>q</sub> | d<sub>rope</sub> | d<sub>nope</sub> |
90
  |-------|--------:|------:|-------------------:|------:|------:|---------:|---------:|
91
  |Llama-3.2-3B-Instruct | 100% | 3.21B | - | -| -| -| -|
92
- |Zebra-Llama-3B-14MLA-14M2 | 4.69% | 3.27B | [0,2,4,6,8,10,12,14,16,18,20,22,24,26] | 128 | 1536 | 64 | 64 |
93
- |Zebra-Llama-3B-6MLA-22M2 | 2.01% | 3.39B | [0,5,10,16,21,26] | 128 | 1536 | 64 | 64 |
94
 
95
  ### Benchmark results
96
  Zebra-Llama was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
97
- | Tasks | Metric | Llama-3.2-3B-Instruct | Zebra-Llama-3B-6MLA-22M2-SFT | Zebra-Llama-3B-6MLA-22M2-DPO | Zebra-Llama-3B-14MLA-14M2-SFT | Zebra-Llama-3B-14MLA-14M2-DPO |
98
  |-------------------|----------|------------------: |----: |----:| ----:|----:|
99
- | arc_challenge | acc | 0.4369±0.0145 | 0.3507 (±0.0139) | 0.3976 (±0.0143)| 0.3456 (±0.0139) | 0.3951 (±0.0143)|
100
- | | acc_norm | 0.4590±0.0146 | 0.3908 (±0.0143) | 0.4232 (±0.0144)| 0.3797 (±0.0142)| 0.4249 (±0.0144)|
101
- | arc_easy | acc | 0.7428±0.0090 | 0.7054 (±0.0094) | 0.7226 (±0.0092)| 0.7092 (±0.0093)| 0.7239 (±0.0092)|
102
- | | acc_norm | 0.6776±0.0096 | 0.6536 (±0.0098) | 0.6696 (±0.0097)| 0.6641 (±0.0097)| 0.6726 (±0.0096)|
103
- | hellaswag | acc | 0.5222±0.0050 | 0.4272 (±0.0049) | 0.4399 (±0.005) | 0.4366 (±0.0049)| 0.4527 (±0.0050)|
104
- | | acc_norm | 0.7036±0.0046 | 0.5691 (±0.0049) | 0.5893 (±0.0049) | 0.5816 (±0.0049)| 0.6061 (±0.0049)|
105
- | mmlu | acc | 0.6046±0.1057 | 0.3739 (±0.0736) | 0.3791 (±0.0742)| 0.3940 (±0.0779)| 0.3909 (±0.0756)|
106
- | - humanities | acc | 0.5926±0.0826 | 0.3456 (±0.0583) | 0.3443 (±0.0634)| 0.3694 (±0.0709)| 0.3700 (±0.0684)|
107
- | - other | acc | 0.6598±0.1118 | 0.4184 (±0.0746) | 0.4081 (±0.0707)| 0.4300 (±0.0747)| 0.4258 (±0.0737)|
108
- | - social_sciences | acc | 0.6701±0.0712 | 0.4098 (±0.0758) | 0.4303 (±0.0709)| 0.4348 (±0.0749)| 0.4283 (±0.0727)|
109
- | - stem | acc | 0.5043±0.1122 | 0.3375 (±0.0730) | 0.3527 (±0.077)| 0.3555 (±0.0776)| 0.3511 (±0.0746)|
110
- | openbookqa | acc | 0.2740±0.0200 | 0.2800 (±0.0201) | 0.302 (±0.0206)| 0.2480 (±0.0193)| 0.3000 (±0.0205)|
111
- | | acc_norm | 0.3620±0.021 | 0.3700 (±0.0216) | 0.406 (±0.022)| 0.3800 (±0.0217)| 0.4180 (±0.0221)|
112
- | piqa | acc | 0.7606±0.0100| 0.7214 (±0.0105) | 0.7252 (±0.0104)| 0.7252 (±0.0104)| 0.7280 (±0.0104)|
113
- | | acc_norm | 0.7557±0.0100 | 0.7225 (±0.0104) | 0.7296 (±0.0104)| 0.7269 (±0.0104)| 0.7296 (±0.0104)|
114
- | pubmedqa | acc | 0.6960±0.0206 | 0.5760 (±0.0221) | 0.566 (±0.0222)| 0.5940 (±0.0220)| 0.5860 (±0.0220)|
115
- | race | acc | 0.4077±0.0152| 0.3445 (±0.0147) | 0.377 (±0.015)| 0.3694 (±0.0149)| 0.3866 (±0.0151)|
116
- | winogrande | acc | 0.6717±0.0132 | 0.5785 (±0.0139) | 0.5888 (±0.0138)| 0.6125 (±0.0137)| 0.6133 (±0.0137)|
117
 
118
  ## Conclusion
119
  Zebra-Llama demonstrates a practical and scalable framework for composing highly efficient hybrid models from existing pre-trained Transformers. By intelligently combining MLA and Mamba2 layers, this approach drastically reduces memory requirements and improves inference throughput while preserving the strong capabilities of the original model. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.
 
1
  ---
2
+ base_model:
3
+ - meta-llama/Llama-3.2-3B-Instruct
4
  datasets:
5
  - JunxiongWang/sftdatasetv3
6
  model-index:
 
90
  | Model | KV Size | Param | Index of MLA layers | r<sub>kv</sub>| r<sub>q</sub> | d<sub>rope</sub> | d<sub>nope</sub> |
91
  |-------|--------:|------:|-------------------:|------:|------:|---------:|---------:|
92
  |Llama-3.2-3B-Instruct | 100% | 3.21B | - | -| -| -| -|
93
+ |Zebra-Llama-3B-6MLA-22Mamba | 2.01% | 3.39B | [0,5,10,16,21,26] | 128 | 1536 | 64 | 64 |
94
+ |Zebra-Llama-3B-14MLA-14Mamba | 4.69% | 3.27B | [0,2,4,6,8,10,12,14,16,18,20,22,24,26] | 128 | 1536 | 64 | 64 |
95
 
96
  ### Benchmark results
97
  Zebra-Llama was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
98
+ | Tasks | Metric | Llama-3.2-3B-Instruct | Zebra-Llama-3B-6MLA-22M2-SFT | Zebra-Llama-3B-6MLA-22M2-DPO | Zebra-Llama-3B-14MLA-14M2-SFT | Zebra-Llama-3B-14MLA-14M2-DPO |
99
  |-------------------|----------|------------------: |----: |----:| ----:|----:|
100
+ | arc_challenge | acc | 0.4369±0.0145 | 0.4189±0.0144 | 0.4744±0.0146 | 0.4326±0.0145 | 0.4966±0.0146 |
101
+ | | acc_norm | 0.459±0.0146 | 0.4539±0.0145 | 0.5077±0.0146 | 0.4667±0.0146 | 0.5128±0.0146 |
102
+ | arc_easy | acc | 0.7428±0.009 | 0.7677±0.0087 | 0.7984±0.0082 | 0.771±0.0086 | 0.7959±0.0083 |
103
+ | | acc_norm | 0.6776±0.0096 | 0.7205±0.0092 | 0.7609±0.0088 | 0.7269±0.0091 | 0.7614±0.0087 |
104
+ | hellaswag | acc | 0.5222±0.005 | 0.5014±0.005 | 0.5213±0.005 | 0.5068±0.005 | 0.531±0.005 |
105
+ | | acc_norm | 0.7036±0.0046 | 0.6812±0.0047 | 0.7146±0.0045 | 0.6875±0.0046 | 0.7257±0.0045 |
106
+ | mmlu | acc | 0.6046±0.1057 | 0.5049±0.103 | 0.5006±0.1038 | 0.528±0.1069 | 0.521±0.1057 |
107
+ | - humanities | acc | 0.5926±0.0826 | 0.4553±0.1006 | 0.4493±0.0948 | 0.4759±0.1027 | 0.4721±0.1008 |
108
+ | - other | acc | 0.6598±0.1118 | 0.5626±0.0935 | 0.5603±0.0971 | 0.5919±0.0948 | 0.5855±0.0916 |
109
+ | - social_sciences | acc | 0.6701±0.0712 | 0.5899±0.083 | 0.584±0.0871 | 0.6113±0.0848 | 0.6081±0.081 |
110
+ | - stem | acc | 0.5043±0.1122 | 0.4393±0.0894 | 0.437±0.0959 | 0.4615±0.1003 | 0.4456±0.0987 |
111
+ | openbookqa | acc | 0.274±0.02 | 0.278±0.0201 | 0.324±0.021 | 0.282±0.0201 | 0.33±0.021 |
112
+ | | acc_norm | 0.362±0.0215 | 0.386±0.0218 | 0.434±0.0222 | 0.394±0.0219 | 0.424±0.0221 |
113
+ | piqa | acc | 0.7606±0.01 | 0.753±0.0101 | 0.772±0.0098 | 0.7617±0.0099 | 0.7775±0.0097 |
114
+ | | acc_norm | 0.7557±0.01 | 0.7573±0.01 | 0.7726±0.0098 | 0.7579±0.01 | 0.7753±0.0097 |
115
+ | pubmedqa | acc | 0.696±0.0206 | 0.612±0.0218 | 0.62±0.0217 | 0.648±0.0214 | 0.63±0.0216 |
116
+ | race | acc | 0.4077±0.0152 | 0.3904±0.0151 | 0.4249±0.0153 | 0.4048±0.0152 | 0.4593±0.0154 |
117
+ | winogrande | acc | 0.6717±0.0132 | 0.6614±0.0133 | 0.6646±0.0133 | 0.6598±0.0133 | 0.6756±0.0132 |
118
 
119
  ## Conclusion
120
  Zebra-Llama demonstrates a practical and scalable framework for composing highly efficient hybrid models from existing pre-trained Transformers. By intelligently combining MLA and Mamba2 layers, this approach drastically reduces memory requirements and improves inference throughput while preserving the strong capabilities of the original model. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.