File size: 6,368 Bytes
b081c1f 2bf967b 22c4e69 b081c1f 2bf967b 5320d73 2bf967b 5320d73 5fc7d7b 5320d73 22c4e69 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
---
license: apache-2.0
datasets:
- abacusai/MetaMathFewshot
- shahules786/orca-chat
- anon8231489123/ShareGPT_Vicuna_unfiltered
base_model: mistralai/Mistral-7B-v0.1
model-index:
- name: Fewshot-Metamath-OrcaVicuna-Mistral-10B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 56.4
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 78.12
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 59.52
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 50.98
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 76.48
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 13.27
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
name: Open LLM Leaderboard
---
```json
{
"layer_map": [
[0, 16],
[8, 24],
[16, 32]
]
}
```
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)
This model is a variation of [abacusai/Fewshot-Metamath-OrcaVicuna-Mistral](https://huggingface.co/datasets/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral)
that builds on the idea of scaling up models by duplicating layers of the base model, in this case
[mistralai/Mistral-7B-v0.1](https://huggingface.co/datasets/mistralai/Mistral-7B-v0.1). It relies on the functionality added in this PR
https://github.com/huggingface/peft/pull/1368 to train a model with replicated layers without much extra GPU memory. So although there are 48 layers
that have lora adapters added, there are only 32 original layers so the memory usage is pretty much the same as the memory usage for the base 7B model.
This is just a demonstration model to indicate how this approach can be used and the goal is to apply it to much larger models. For example
models like Goliath or MegaDolphin which are effectively 120B models but using this approach they will only use 70B of memory for the base model and
a little extra for the LoRA adaption layers.
In our training runs we did find a difference in the behavior of the eval loss:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/vszXUSmANBw6EFjn4sX1N.png)
vs the loss curve for the original LoRA finetune of the 7B model
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/dis1P2MD_Rsyw81aIVByS.png)
The larger model achieved a best eval loss of 0.3915 vs 0.3971 in a lot fewer steps.
Overall, we think this is a promising approach to accessing much larger models without significantly more resources.
# Performance on Metrics
To do a proper abalation we compared the performance of 4 models trained for ~1 epoch on the combined datasets (Metamath,
Orca, ShareGPT). Here are the results:
| Model | Trainable Params | Train Loss | Eval Loss | GSM8K | TruthfulQA |
| :-----| ------: | ---------: | -------: | ----: | ---------: |
| Mistral 7B | 0 | - | - | 0.374 | 0.426 |
| Mistral 10B | 0 | - | - | 0.290 | 0.407 |
| Mistral 7B + LoRA r=12 | 31M | 0.412 | 0.366 | 0.514 | 0.499 |
| Mistral 10B + LoRA r=8 | 31M | 0.401 | 0.363 | 0.663 | 0.540 |
This ablation compares the base model (Mistral 7B), expansion using the layer map described here and fine tunes of a lora `r=12`
on the base model and `r=8` (to match trainable params). The ablation demonstrates quite clearly that fine tuning the expanded
model leads to a significant improvement in metrics even with the same number of trainable parameters (and training steps).
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_abacusai__Fewshot-Metamath-OrcaVicuna-Mistral-10B)
| Metric |Value|
|---------------------------------|----:|
|Avg. |55.79|
|AI2 Reasoning Challenge (25-Shot)|56.40|
|HellaSwag (10-Shot) |78.12|
|MMLU (5-Shot) |59.52|
|TruthfulQA (0-shot) |50.98|
|Winogrande (5-shot) |76.48|
|GSM8k (5-shot) |13.27|
|