Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
base_model:
|
|
|
|
| 3 |
datasets:
|
| 4 |
- JunxiongWang/sftdatasetv3
|
| 5 |
model-index:
|
|
@@ -89,31 +90,31 @@ print(tokenizer.decode(tokens[0], skip_special_tokens=False))
|
|
| 89 |
| Model | KV Size | Param | Index of MLA layers | r<sub>kv</sub>| r<sub>q</sub> | d<sub>rope</sub> | d<sub>nope</sub> |
|
| 90 |
|-------|--------:|------:|-------------------:|------:|------:|---------:|---------:|
|
| 91 |
|Llama-3.2-3B-Instruct | 100% | 3.21B | - | -| -| -| -|
|
| 92 |
-
|Zebra-Llama-3B-
|
| 93 |
-
|Zebra-Llama-3B-
|
| 94 |
|
| 95 |
### Benchmark results
|
| 96 |
Zebra-Llama was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
|
| 97 |
-
|
|
| 98 |
|-------------------|----------|------------------: |----: |----:| ----:|----:|
|
| 99 |
-
| arc_challenge | acc |
|
| 100 |
-
| | acc_norm |
|
| 101 |
-
| arc_easy | acc |
|
| 102 |
-
| | acc_norm |
|
| 103 |
-
| hellaswag | acc |
|
| 104 |
-
| | acc_norm |
|
| 105 |
-
| mmlu | acc |
|
| 106 |
-
| - humanities | acc |
|
| 107 |
-
| - other | acc |
|
| 108 |
-
| - social_sciences | acc |
|
| 109 |
-
| - stem | acc |
|
| 110 |
-
| openbookqa | acc |
|
| 111 |
-
| | acc_norm |
|
| 112 |
-
| piqa | acc |
|
| 113 |
-
| | acc_norm |
|
| 114 |
-
| pubmedqa | acc |
|
| 115 |
-
| race | acc |
|
| 116 |
-
| winogrande | acc |
|
| 117 |
|
| 118 |
## Conclusion
|
| 119 |
Zebra-Llama demonstrates a practical and scalable framework for composing highly efficient hybrid models from existing pre-trained Transformers. By intelligently combining MLA and Mamba2 layers, this approach drastically reduces memory requirements and improves inference throughput while preserving the strong capabilities of the original model. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- meta-llama/Llama-3.2-3B-Instruct
|
| 4 |
datasets:
|
| 5 |
- JunxiongWang/sftdatasetv3
|
| 6 |
model-index:
|
|
|
|
| 90 |
| Model | KV Size | Param | Index of MLA layers | r<sub>kv</sub>| r<sub>q</sub> | d<sub>rope</sub> | d<sub>nope</sub> |
|
| 91 |
|-------|--------:|------:|-------------------:|------:|------:|---------:|---------:|
|
| 92 |
|Llama-3.2-3B-Instruct | 100% | 3.21B | - | -| -| -| -|
|
| 93 |
+
|Zebra-Llama-3B-6MLA-22Mamba | 2.01% | 3.39B | [0,5,10,16,21,26] | 128 | 1536 | 64 | 64 |
|
| 94 |
+
|Zebra-Llama-3B-14MLA-14Mamba | 4.69% | 3.27B | [0,2,4,6,8,10,12,14,16,18,20,22,24,26] | 128 | 1536 | 64 | 64 |
|
| 95 |
|
| 96 |
### Benchmark results
|
| 97 |
Zebra-Llama was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
|
| 98 |
+
| Tasks | Metric | Llama-3.2-3B-Instruct | Zebra-Llama-3B-6MLA-22M2-SFT | Zebra-Llama-3B-6MLA-22M2-DPO | Zebra-Llama-3B-14MLA-14M2-SFT | Zebra-Llama-3B-14MLA-14M2-DPO |
|
| 99 |
|-------------------|----------|------------------: |----: |----:| ----:|----:|
|
| 100 |
+
| arc_challenge | acc | 0.4369±0.0145 | 0.4189±0.0144 | 0.4744±0.0146 | 0.4326±0.0145 | 0.4966±0.0146 |
|
| 101 |
+
| | acc_norm | 0.459±0.0146 | 0.4539±0.0145 | 0.5077±0.0146 | 0.4667±0.0146 | 0.5128±0.0146 |
|
| 102 |
+
| arc_easy | acc | 0.7428±0.009 | 0.7677±0.0087 | 0.7984±0.0082 | 0.771±0.0086 | 0.7959±0.0083 |
|
| 103 |
+
| | acc_norm | 0.6776±0.0096 | 0.7205±0.0092 | 0.7609±0.0088 | 0.7269±0.0091 | 0.7614±0.0087 |
|
| 104 |
+
| hellaswag | acc | 0.5222±0.005 | 0.5014±0.005 | 0.5213±0.005 | 0.5068±0.005 | 0.531±0.005 |
|
| 105 |
+
| | acc_norm | 0.7036±0.0046 | 0.6812±0.0047 | 0.7146±0.0045 | 0.6875±0.0046 | 0.7257±0.0045 |
|
| 106 |
+
| mmlu | acc | 0.6046±0.1057 | 0.5049±0.103 | 0.5006±0.1038 | 0.528±0.1069 | 0.521±0.1057 |
|
| 107 |
+
| - humanities | acc | 0.5926±0.0826 | 0.4553±0.1006 | 0.4493±0.0948 | 0.4759±0.1027 | 0.4721±0.1008 |
|
| 108 |
+
| - other | acc | 0.6598±0.1118 | 0.5626±0.0935 | 0.5603±0.0971 | 0.5919±0.0948 | 0.5855±0.0916 |
|
| 109 |
+
| - social_sciences | acc | 0.6701±0.0712 | 0.5899±0.083 | 0.584±0.0871 | 0.6113±0.0848 | 0.6081±0.081 |
|
| 110 |
+
| - stem | acc | 0.5043±0.1122 | 0.4393±0.0894 | 0.437±0.0959 | 0.4615±0.1003 | 0.4456±0.0987 |
|
| 111 |
+
| openbookqa | acc | 0.274±0.02 | 0.278±0.0201 | 0.324±0.021 | 0.282±0.0201 | 0.33±0.021 |
|
| 112 |
+
| | acc_norm | 0.362±0.0215 | 0.386±0.0218 | 0.434±0.0222 | 0.394±0.0219 | 0.424±0.0221 |
|
| 113 |
+
| piqa | acc | 0.7606±0.01 | 0.753±0.0101 | 0.772±0.0098 | 0.7617±0.0099 | 0.7775±0.0097 |
|
| 114 |
+
| | acc_norm | 0.7557±0.01 | 0.7573±0.01 | 0.7726±0.0098 | 0.7579±0.01 | 0.7753±0.0097 |
|
| 115 |
+
| pubmedqa | acc | 0.696±0.0206 | 0.612±0.0218 | 0.62±0.0217 | 0.648±0.0214 | 0.63±0.0216 |
|
| 116 |
+
| race | acc | 0.4077±0.0152 | 0.3904±0.0151 | 0.4249±0.0153 | 0.4048±0.0152 | 0.4593±0.0154 |
|
| 117 |
+
| winogrande | acc | 0.6717±0.0132 | 0.6614±0.0133 | 0.6646±0.0133 | 0.6598±0.0133 | 0.6756±0.0132 |
|
| 118 |
|
| 119 |
## Conclusion
|
| 120 |
Zebra-Llama demonstrates a practical and scalable framework for composing highly efficient hybrid models from existing pre-trained Transformers. By intelligently combining MLA and Mamba2 layers, this approach drastically reduces memory requirements and improves inference throughput while preserving the strong capabilities of the original model. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.
|