fblgit commited on
Commit
f322e92
1 Parent(s): 3e12f69

Updated Scores

Browse files
Files changed (1) hide show
  1. README.md +73 -49
README.md CHANGED
@@ -21,7 +21,8 @@ model-index:
21
  split: validation
22
  metrics:
23
  - type: accuracy
24
- value: 65.49
 
25
  - task:
26
  type: text-generation
27
  name: ARC-Challenge
@@ -32,7 +33,8 @@ model-index:
32
  split: test
33
  metrics:
34
  - type: accuracy
35
- value: 68.09
 
36
  - task:
37
  type: text-generation
38
  name: HellaSwag
@@ -42,18 +44,8 @@ model-index:
42
  split: test
43
  metrics:
44
  - type: accuracy
45
- value: 85.20
46
- - task:
47
- type: text-generation
48
- name: GSM8k
49
- dataset:
50
- type: text-generation
51
- name: gsm8k
52
- config: main
53
- split: test
54
- metrics:
55
- - type: accuracy
56
- value: 48.98
57
  - task:
58
  type: text-generation
59
  name: Winogrande
@@ -64,7 +56,8 @@ model-index:
64
  split: test
65
  metrics:
66
  - type: accuracy
67
- value: 76.8
 
68
  - task:
69
  type: text-generation
70
  name: MMLU
@@ -75,7 +68,8 @@ model-index:
75
  split: test
76
  metrics:
77
  - type: accuracy
78
- value: 61.37
 
79
  - task:
80
  type: text-generation
81
  name: PiQA
@@ -95,7 +89,8 @@ model-index:
95
  split: validation
96
  metrics:
97
  - type: accuracy
98
- value: 49.8
 
99
  - task:
100
  type: text-generation
101
  name: PubMedQA
@@ -109,28 +104,23 @@ model-index:
109
  value: 76.0
110
  ---
111
 
112
- # juanako-7b-UNA-v2
113
 
114
  This model is a fine-tuned version of [fblgit/juanako-7b-UNA-v2-phase-1](https://huggingface.co/fblgit/juanako-7b-UNA-v2-phase-1) on the HuggingFaceH4/ultrafeedback_binarized dataset.
115
  It outperforms in many aspects most of the current Mistral based models and is the **latest and most powerful juanako version as of now**.
116
 
117
- ## Scoring and records (26-November-2023)
118
- Here are some results:
119
- * Scores #1 7B Model
120
- * Scores #4 GSM8k
121
- * Scores #2 in TruthfulQA
122
- * Scores #6 in CoPa
123
- * Scores #2 in PiQA
124
- * Scores #9 in BoolQ
125
  | Model | Average ⬆️| ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️| TruthfulQA (MC) (0-s) ⬆️ | Winogrande (5-s) | GSM8K (5-s) | DROP (3-s) |
126
  | --- | --- | --- | --- | --- | --- | --- | --- | --- |
127
  |[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | 50.32 | 59.58 | 83.31 | 64.16 | 42.15 | 78.37 | 18.12 | 6.14 |
128
  | [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) | 59.0 | 66.21 | 83.64 | 62.37 | 59.65 | 78.14 | 19.56 | 43.84 |
129
- | [fblgit/juanako-7b-UNA](https://huggingface.co/fblgit/juanako-7b-UNA) | **65.10** | **68.09** | **85.20** | 61.37 | **65.49** | 76.8 | **48.98** | **49.8** |
130
 
131
- Many evaluations were performed, but it behaves very balanced in multiple fields. Feel free to submit more evaluation results.
132
-
133
- It scores: **65.1** according HuggingFace LLM Leaderboard.
134
 
135
  Author [Xavier M.](mailto:xavi@juanako.ai) @fblgit
136
 
@@ -138,33 +128,68 @@ Author [Xavier M.](mailto:xavi@juanako.ai) @fblgit
138
 
139
  juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published.
140
 
141
- ## TruthfulQA 0-Shot
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  ```
143
  | Tasks |Version|Filter|Metric|Value | |Stderr|
144
  |--------------|-------|------|------|-----:|---|-----:|
145
  |truthfulqa_mc2|Yaml |none |acc |0.6549|± |0.0153|
146
  ```
147
- ## ARC 25-Shot
148
  ```
149
  | Tasks |Version|Filter| Metric |Value | |Stderr|
150
  |-------------|-------|------|--------|-----:|---|-----:|
151
  |arc_challenge|Yaml |none |acc |0.6476|± |0.0140|
152
  | | |none |acc_norm|0.6809|± |0.0136|
153
  ```
154
- ## HellaSwag 10-Shot
155
  ```
156
  | Tasks |Version|Filter| Metric |Value | |Stderr|
157
  |---------|-------|------|--------|-----:|---|-----:|
158
  |hellaswag|Yaml |none |acc |0.6703|± |0.0047|
159
  | | |none |acc_norm|0.8520|± |0.0035|
160
  ```
161
- ## GSM8k 5-Shot
162
  ```
163
  |Tasks|Version| Filter | Metric |Value | |Stderr|
164
  |-----|-------|----------|-----------|-----:|---|-----:|
165
  |gsm8k|Yaml |get-answer|exact_match|0.4898|± |0.0138|
166
  ```
167
- ## GPT Evaluations 0-Shot
168
  ```
169
  | Tasks |Version|Filter| Metric |Value | |Stderr|
170
  |--------------|-------|------|----------|-----:|---|-----:|
@@ -176,39 +201,39 @@ juanako uses UNA, Uniform Neural Alignment. A training technique that ease align
176
  |sciq |Yaml |none |acc |0.9580|± |0.0063|
177
  | | |none |acc_norm |0.9130|± |0.0089|
178
  ```
179
- ## MathQA 0-Shot
180
  ```
181
  |Tasks |Version|Filter| Metric |Value | |Stderr|
182
  |------|-------|------|--------|-----:|---|-----:|
183
  |mathqa|Yaml |none |acc |0.3752|± |0.0089|
184
  | | |none |acc_norm|0.3772|± |0.0089|
185
  ```
186
- ## PiQa 1-Shot
187
  ```
188
  |Tasks|Version|Filter| Metric |Value | |Stderr|
189
  |-----|-------|------|--------|-----:|---|-----:|
190
  |piqa |Yaml |none |acc |0.8308|± |0.0087|
191
  | | |none |acc_norm|0.8357|± |0.0086|
192
  ```
193
- ## Winogrande 5-Shot
194
  ```
195
  | Tasks |Version|Filter|Metric|Value| |Stderr|
196
  |----------|-------|------|------|----:|---|-----:|
197
  |winogrande|Yaml |none |acc |0.768|± |0.0119|
198
  ```
199
- ## PubMedQA 0-Shot
200
  ```
201
  | Tasks |Version|Filter|Metric|Value| |Stderr|
202
  |--------|-------|------|------|----:|---|-----:|
203
  |pubmedqa|Yaml |none |acc | 0.76|± |0.0191|
204
  ```
205
- ## RACE 1-Shot
206
  ```
207
  |Tasks|Version|Filter|Metric|Value | |Stderr|
208
  |-----|-------|------|------|-----:|---|-----:|
209
  |race |Yaml |none |acc |0.5282|± |0.0154|
210
  ```
211
- ## MMLU 5-Shot (8-Bit)
212
  ```
213
  | Groups |Version|Filter|Metric|Value | |Stderr|
214
  |------------------|-------|------|------|-----:|---|-----:|
@@ -218,19 +243,22 @@ juanako uses UNA, Uniform Neural Alignment. A training technique that ease align
218
  | - social_sciences|N/A |none |acc |0.7195|± |0.0713|
219
  | - stem |N/A |none |acc |0.5087|± |0.1297|
220
  ```
221
- ## DROP 3-Shot (8-Bit) (Instruct-Eval)
222
  ```
223
  {'score': 0.49801113762927607}
224
  {'drop': 49.8}
225
  drop: 49.8
226
  ```
227
 
228
- ## CRASS 0-Shot (Instruct-Eval)
229
  ```
230
  {'score': 0.8357664233576643}
231
  {'crass': 83.58}
232
  crass: 83.58
233
  ```
 
 
 
234
  ### Training hyperparameters
235
 
236
  The following hyperparameters were used during training:
@@ -267,6 +295,7 @@ The following hyperparameters were used during training:
267
 
268
  ## Citations
269
  If you find juanako useful please:
 
270
  ```
271
  @misc{juanako7buna,
272
  title={Juanako: Uniform Neural Alignment},
@@ -278,6 +307,7 @@ If you find juanako useful please:
278
  }
279
  ```
280
 
 
281
  ```
282
  @misc{lin2021truthfulqa,
283
  title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
@@ -295,12 +325,6 @@ If you find juanako useful please:
295
  archivePrefix={arXiv},
296
  primaryClass={cs.LG}
297
  }
298
- @article{cobbe2021gsm8k,
299
- title={Training Verifiers to Solve Math Word Problems},
300
- author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
301
- journal={arXiv preprint arXiv:2110.14168},
302
- year={2021}
303
- }
304
  @inproceedings{Bisk2020,
305
  author = {Yonatan Bisk and Rowan Zellers and
306
  Ronan Le Bras and Jianfeng Gao
 
21
  split: validation
22
  metrics:
23
  - type: accuracy
24
+ value: 65.13
25
+ verified: true
26
  - task:
27
  type: text-generation
28
  name: ARC-Challenge
 
33
  split: test
34
  metrics:
35
  - type: accuracy
36
+ value: 68.17
37
+ verified: true
38
  - task:
39
  type: text-generation
40
  name: HellaSwag
 
44
  split: test
45
  metrics:
46
  - type: accuracy
47
+ value: 85.34
48
+ verified: true
 
 
 
 
 
 
 
 
 
 
49
  - task:
50
  type: text-generation
51
  name: Winogrande
 
56
  split: test
57
  metrics:
58
  - type: accuracy
59
+ value: 78.85
60
+ verified: true
61
  - task:
62
  type: text-generation
63
  name: MMLU
 
68
  split: test
69
  metrics:
70
  - type: accuracy
71
+ value: 62.47
72
+ verified: true
73
  - task:
74
  type: text-generation
75
  name: PiQA
 
89
  split: validation
90
  metrics:
91
  - type: accuracy
92
+ value: 38.74
93
+ verified: true
94
  - task:
95
  type: text-generation
96
  name: PubMedQA
 
104
  value: 76.0
105
  ---
106
 
107
+ # juanako-7b-UNA (Uniform Neural Alignment)
108
 
109
  This model is a fine-tuned version of [fblgit/juanako-7b-UNA-v2-phase-1](https://huggingface.co/fblgit/juanako-7b-UNA-v2-phase-1) on the HuggingFaceH4/ultrafeedback_binarized dataset.
110
  It outperforms in many aspects most of the current Mistral based models and is the **latest and most powerful juanako version as of now**.
111
 
112
+ ## Scores
113
+
114
+ The official HuggingFace results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/fblgit/juanako-7b-UNA/results_2023-11-28T08-33-33.965228.json)
115
+
 
 
 
 
116
  | Model | Average ⬆️| ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️| TruthfulQA (MC) (0-s) ⬆️ | Winogrande (5-s) | GSM8K (5-s) | DROP (3-s) |
117
  | --- | --- | --- | --- | --- | --- | --- | --- | --- |
118
  |[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | 50.32 | 59.58 | 83.31 | 64.16 | 42.15 | 78.37 | 18.12 | 6.14 |
119
  | [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) | 59.0 | 66.21 | 83.64 | 62.37 | 59.65 | 78.14 | 19.56 | 43.84 |
120
+ | [fblgit/juanako-7b-UNA](https://huggingface.co/fblgit/juanako-7b-UNA) | **59.91** | **68.17** | **85.34** | 62.47 | **65.13** | **78.85** | **20.7** | 38.74 |
121
 
122
+ It scores: **59.91** according HuggingFace LLM Leaderboard.
123
+ It scores: **65.1** with `big-refactor` branch of lm-eval-harness
 
124
 
125
  Author [Xavier M.](mailto:xavi@juanako.ai) @fblgit
126
 
 
128
 
129
  juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published.
130
 
131
+ ### Prompts
132
+ The following prompts showed positive results, it may depend the task and needs further experimentation but this should work for starters:
133
+ ```
134
+ <|im_start|>system
135
+ - You are a helpful assistant chatbot trained by MosaicML.
136
+ - You answer questions.
137
+ - You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
138
+ - You are more than just an information source, you are also able to write poetry, short stories, and make jokes.<|im_end|>
139
+ <|im_start|>user
140
+ Explain QKV<|im_end|>
141
+ <|im_start|>assistant
142
+ ```
143
+ ```
144
+ ### Assistant: I am StableVicuna, a large language model created by CarperAI. I am here to chat!
145
+
146
+ ### Human: Explain QKV
147
+ ### Assistant:
148
+ ```
149
+ ```
150
+ [Round <|round|>]
151
+ 问:Explain QKV
152
+ 答:
153
+ ```
154
+ ```
155
+ [Round <|round|>]
156
+ Question:Explain QKV
157
+ Answer:
158
+ ```
159
+ ```
160
+ Question:Explain QKV
161
+ Answer:
162
+ ```
163
+
164
+ ## Evaluations (lm-eval big-refactor branch)
165
+
166
+ ### TruthfulQA 0-Shot
167
  ```
168
  | Tasks |Version|Filter|Metric|Value | |Stderr|
169
  |--------------|-------|------|------|-----:|---|-----:|
170
  |truthfulqa_mc2|Yaml |none |acc |0.6549|± |0.0153|
171
  ```
172
+ ### ARC 25-Shot
173
  ```
174
  | Tasks |Version|Filter| Metric |Value | |Stderr|
175
  |-------------|-------|------|--------|-----:|---|-----:|
176
  |arc_challenge|Yaml |none |acc |0.6476|± |0.0140|
177
  | | |none |acc_norm|0.6809|± |0.0136|
178
  ```
179
+ ### HellaSwag 10-Shot
180
  ```
181
  | Tasks |Version|Filter| Metric |Value | |Stderr|
182
  |---------|-------|------|--------|-----:|---|-----:|
183
  |hellaswag|Yaml |none |acc |0.6703|± |0.0047|
184
  | | |none |acc_norm|0.8520|± |0.0035|
185
  ```
186
+ ### GSM8k 5-Shot
187
  ```
188
  |Tasks|Version| Filter | Metric |Value | |Stderr|
189
  |-----|-------|----------|-----------|-----:|---|-----:|
190
  |gsm8k|Yaml |get-answer|exact_match|0.4898|± |0.0138|
191
  ```
192
+ ### GPT Evaluations 0-Shot
193
  ```
194
  | Tasks |Version|Filter| Metric |Value | |Stderr|
195
  |--------------|-------|------|----------|-----:|---|-----:|
 
201
  |sciq |Yaml |none |acc |0.9580|± |0.0063|
202
  | | |none |acc_norm |0.9130|± |0.0089|
203
  ```
204
+ ### MathQA 0-Shot
205
  ```
206
  |Tasks |Version|Filter| Metric |Value | |Stderr|
207
  |------|-------|------|--------|-----:|---|-----:|
208
  |mathqa|Yaml |none |acc |0.3752|± |0.0089|
209
  | | |none |acc_norm|0.3772|± |0.0089|
210
  ```
211
+ ### PiQa 1-Shot
212
  ```
213
  |Tasks|Version|Filter| Metric |Value | |Stderr|
214
  |-----|-------|------|--------|-----:|---|-----:|
215
  |piqa |Yaml |none |acc |0.8308|± |0.0087|
216
  | | |none |acc_norm|0.8357|± |0.0086|
217
  ```
218
+ ### Winogrande 5-Shot
219
  ```
220
  | Tasks |Version|Filter|Metric|Value| |Stderr|
221
  |----------|-------|------|------|----:|---|-----:|
222
  |winogrande|Yaml |none |acc |0.768|± |0.0119|
223
  ```
224
+ ### PubMedQA 0-Shot
225
  ```
226
  | Tasks |Version|Filter|Metric|Value| |Stderr|
227
  |--------|-------|------|------|----:|---|-----:|
228
  |pubmedqa|Yaml |none |acc | 0.76|± |0.0191|
229
  ```
230
+ ### RACE 1-Shot
231
  ```
232
  |Tasks|Version|Filter|Metric|Value | |Stderr|
233
  |-----|-------|------|------|-----:|---|-----:|
234
  |race |Yaml |none |acc |0.5282|± |0.0154|
235
  ```
236
+ ### MMLU 5-Shot (8-Bit)
237
  ```
238
  | Groups |Version|Filter|Metric|Value | |Stderr|
239
  |------------------|-------|------|------|-----:|---|-----:|
 
243
  | - social_sciences|N/A |none |acc |0.7195|± |0.0713|
244
  | - stem |N/A |none |acc |0.5087|± |0.1297|
245
  ```
246
+ ### DROP 3-Shot (8-Bit) (Instruct-Eval)
247
  ```
248
  {'score': 0.49801113762927607}
249
  {'drop': 49.8}
250
  drop: 49.8
251
  ```
252
 
253
+ ### CRASS 0-Shot (Instruct-Eval)
254
  ```
255
  {'score': 0.8357664233576643}
256
  {'crass': 83.58}
257
  crass: 83.58
258
  ```
259
+
260
+ ## Training Details
261
+
262
  ### Training hyperparameters
263
 
264
  The following hyperparameters were used during training:
 
295
 
296
  ## Citations
297
  If you find juanako useful please:
298
+
299
  ```
300
  @misc{juanako7buna,
301
  title={Juanako: Uniform Neural Alignment},
 
307
  }
308
  ```
309
 
310
+ Thanks to all the brilliant humans behind the creation of AI, here some of the ones that we find relevant to our research. If you feel a citation is missing, please contact.
311
  ```
312
  @misc{lin2021truthfulqa,
313
  title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
 
325
  archivePrefix={arXiv},
326
  primaryClass={cs.LG}
327
  }
 
 
 
 
 
 
328
  @inproceedings{Bisk2020,
329
  author = {Yonatan Bisk and Rowan Zellers and
330
  Ronan Le Bras and Jianfeng Gao