Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ and first released at [this page](https://github.com/facebookresearch/llama).
|
|
24 |
|
25 |
What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapland, wolverines are the biggest cause of reindeer damage.
|
26 |
|
27 |
-
There are two different sized base Ahma models
|
28 |
|
29 |
| Model | Context length | Layers | Dim | Heads | Params |
|
30 |
|:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
|
@@ -203,40 +203,40 @@ This Ahma 3B base model was primarily evaluated using [FIN-bench by TurkuNLP](ht
|
|
203 |
|
204 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
205 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
206 |
-
| Analogies | 50.77 | 48.46 |
|
207 |
-
| Arithmetic | 27.64 | 22.14 |
|
208 |
-
| Cause and Effect | 59.48 | 58.82 |
|
209 |
-
| Emotions | 36.25 | 28.12 |
|
210 |
-
| Empirical Judgements | 33.33 | 35.35 |
|
211 |
-
| General Knowledge | 44.29 | 48.57 |
|
212 |
-
| HHH Alignment | 42.09 | 41.66 |
|
213 |
-
| Intent Recognition | 24.42 | 26.16 |
|
214 |
-
| Misconceptions | 46.27 | 47.01 |
|
215 |
-
| Paraphrase | 59.50 | 73.00 |
|
216 |
-
| Sentence Ambiguity | 53.33 | 65.00 |
|
217 |
-
| Similarities Abstraction | 65.79 | 68.42 |
|
218 |
-
| **Non-Arithmetic Average** | **47.55** | **48.95** |
|
219 |
-
| **Overall Average** | **36.49** | **34.06** |
|
220 |
|
221 |
|
222 |
3-shot results:
|
223 |
|
224 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
225 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
226 |
-
| Analogies | 50.77 | 49.23 |
|
227 |
-
| Arithmetic | 38.38 | 43.89 |
|
228 |
-
| Cause and Effect | 60.78 | 64.71 |
|
229 |
-
| Emotions | 30.00 | 41.25 |
|
230 |
-
| Empirical Judgements | 46.46 | 44.44 |
|
231 |
-
| General Knowledge | 47.14 | 40.00 |
|
232 |
-
| HHH Alignment | 43.53 | 44.80 |
|
233 |
-
| Intent Recognition | 20.52 | 44.22 |
|
234 |
-
| Misconceptions | 50.75 | 52.24 |
|
235 |
-
| Paraphrase | 50.50 | 58.50 |
|
236 |
-
| Sentence Ambiguity | 53.33 | 48.33 |
|
237 |
-
| Similarities Abstraction | 69.74 | 72.37 |
|
238 |
-
| **Non-Arithmetic Average** | **48.48** | **51.49** |
|
239 |
-
| **Overall Average** | **42.87** | **47.27** |
|
240 |
|
241 |
|
242 |
As we can see, Ahma 3B base model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma 3B actually surpasses it in some tasks. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
|
@@ -252,29 +252,29 @@ Single-turn results:
|
|
252 |
|
253 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
|
254 |
|:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
|
255 |
-
| Coding | 1.00 | 1.00 |
|
256 |
-
| Extraction | 2.00 | 1.30 |
|
257 |
-
| Humanities | 4.05 | 6.20 |
|
258 |
-
| Math | 3.00 | 3.20 |
|
259 |
-
| Reasoning | 2.90 | 4.60 |
|
260 |
-
| Roleplay | 4.80 | 6.50 |
|
261 |
-
| STEM | 5.10 | 5.95 |
|
262 |
-
| Writing | 6.60 | 9.00 |
|
263 |
-
| **Overall Average** | **3.68** | **4.72** |
|
264 |
|
265 |
Multi-turn results:
|
266 |
|
267 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
|
268 |
|:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
|
269 |
-
| Coding | 1.00 | 1.00 |
|
270 |
-
| Extraction | 1.55 | 1.15 |
|
271 |
-
| Humanities | 3.25 | 6.20 |
|
272 |
-
| Math | 2.20 | 2.70 |
|
273 |
-
| Reasoning | 2.45 | 3.50 |
|
274 |
-
| Roleplay | 4.90 | 6.40 |
|
275 |
-
| STEM | 4.20 | 4.78 |
|
276 |
-
| Writing | 3.80 | 6.65 |
|
277 |
-
| **Overall Average** | **2.92** | **4.05** |
|
278 |
|
279 |
As we can see, Ahma 3B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 3B model is not trained with code data. Ahma 3B also seemed to have problems with the fact that it started to constantly repeat the generated text in some evaluation examples, which affected the scoring. With the addition of a repetition penalty setting to the evaluation script generation method, the scores already improved significantly, so the Ahma 3B model should be used with better generation settings in real-world use compared to the settings used in this benchmark.
|
280 |
|
|
|
24 |
|
25 |
What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapland, wolverines are the biggest cause of reindeer damage.
|
26 |
|
27 |
+
There are two different sized base Ahma models both pretrained from scratch, Ahma-3B for 139B tokens and Ahma-7B for 149B tokens:
|
28 |
|
29 |
| Model | Context length | Layers | Dim | Heads | Params |
|
30 |
|:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
|
|
|
203 |
|
204 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
205 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
206 |
+
| Analogies | 50.77 | 48.46 | 56.92 | TBA | 49.23 | 40.00 | 54.62 |
|
207 |
+
| Arithmetic | 27.64 | 22.14 | 11.50 | TBA | 33.15 | 30.16 | 30.34 |
|
208 |
+
| Cause and Effect | 59.48 | 58.82 | 59.48 | TBA | 66.01 | 58.82 | 62.74 |
|
209 |
+
| Emotions | 36.25 | 28.12 | 36.25 | TBA | 22.50 | 26.25 | 35.63 |
|
210 |
+
| Empirical Judgements | 33.33 | 35.35 | 33.33 | TBA | 27.27 | 33.33 | 49.49 |
|
211 |
+
| General Knowledge | 44.29 | 48.57 | 51.43 | TBA | 40.00 | 24.29 | 51.43 |
|
212 |
+
| HHH Alignment | 42.09 | 41.66 | 44.23 | TBA | 41.81 | 42.51 | 42.92 |
|
213 |
+
| Intent Recognition | 24.42 | 26.16 | 43.64 | TBA | 17.49 | 22.40 | 68.35 |
|
214 |
+
| Misconceptions | 46.27 | 47.01 | 46.27 | TBA | 53.73 | 53.73 | 52.24 |
|
215 |
+
| Paraphrase | 59.50 | 73.00 | 67.00 | TBA | 51.00 | 50.00 | 51.00 |
|
216 |
+
| Sentence Ambiguity | 53.33 | 65.00 | 60.00 | TBA | 51.67 | 48.33 | 50.00 |
|
217 |
+
| Similarities Abstraction | 65.79 | 68.42 | 71.05 | TBA | 60.53 | 65.79 | 60.53 |
|
218 |
+
| **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | TBA | **46.17** | **44.42** | **52.08** |
|
219 |
+
| **Overall Average** | **36.49** | **34.06** | **29.20** | TBA | **38.93** | **36.50** | **40.00** |
|
220 |
|
221 |
|
222 |
3-shot results:
|
223 |
|
224 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
225 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
226 |
+
| Analogies | 50.77 | 49.23 | 49.23 | TBA | 40.77 | 54.62 | 76.92 |
|
227 |
+
| Arithmetic | 38.38 | 43.89 | 20.88 | TBA | 43.63 | 45.78 | 53.68 |
|
228 |
+
| Cause and Effect | 60.78 | 64.71 | 66.01 | TBA | 64.05 | 58.17 | 67.32 |
|
229 |
+
| Emotions | 30.00 | 41.25 | 30.00 | TBA | 44.37 | 48.13 | 56.87 |
|
230 |
+
| Empirical Judgements | 46.46 | 44.44 | 39.39 | TBA | 32.32 | 43.43 | 63.64 |
|
231 |
+
| General Knowledge | 47.14 | 40.00 | 27.14 | TBA | 54.29 | 28.57 | 74.29 |
|
232 |
+
| HHH Alignment | 43.53 | 44.80 | 43.80 | TBA | 45.39 | 44.80 | 46.07 |
|
233 |
+
| Intent Recognition | 20.52 | 44.22 | 36.42 | TBA | 51.45 | 58.82 | 83.67 |
|
234 |
+
| Misconceptions | 50.75 | 52.24 | 46.27 | TBA | 52.99 | 46.27 | 52.99 |
|
235 |
+
| Paraphrase | 50.50 | 58.50 | 57.50 | TBA | 53.00 | 54.50 | 55.00 |
|
236 |
+
| Sentence Ambiguity | 53.33 | 48.33 | 53.33 | TBA | 51.67 | 53.33 | 66.67 |
|
237 |
+
| Similarities Abstraction | 69.74 | 72.37 | 72.37 | TBA | 64.47 | 73.68 | 75.00 |
|
238 |
+
| **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | TBA | **51.19** | **50.94** | **61.96** |
|
239 |
+
| **Overall Average** | **42.87** | **47.27** | **33.41** | TBA | **46.99** | **48.07** | **57.36** |
|
240 |
|
241 |
|
242 |
As we can see, Ahma 3B base model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma 3B actually surpasses it in some tasks. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
|
|
|
252 |
|
253 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
|
254 |
|:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
|
255 |
+
| Coding | 1.00 | 1.00 | 1.70 | TBA |
|
256 |
+
| Extraction | 2.00 | 1.30 | 3.10 | TBA |
|
257 |
+
| Humanities | 4.05 | 6.20 | 6.60 | TBA |
|
258 |
+
| Math | 3.00 | 3.20 | 3.90 | TBA |
|
259 |
+
| Reasoning | 2.90 | 4.60 | 3.70 | TBA |
|
260 |
+
| Roleplay | 4.80 | 6.50 | 6.60 | TBA |
|
261 |
+
| STEM | 5.10 | 5.95 | 6.75 | TBA |
|
262 |
+
| Writing | 6.60 | 9.00 | 7.10 | TBA |
|
263 |
+
| **Overall Average** | **3.68** | **4.72** | **4.93** | TBA |
|
264 |
|
265 |
Multi-turn results:
|
266 |
|
267 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
|
268 |
|:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
|
269 |
+
| Coding | 1.00 | 1.00 | 1.40 | TBA | 3.70 |
|
270 |
+
| Extraction | 1.55 | 1.15 | 2.05 | TBA | 6.37 |
|
271 |
+
| Humanities | 3.25 | 6.20 | 4.95 | TBA | 9.25 |
|
272 |
+
| Math | 2.20 | 2.70 | 2.50 | TBA | 1.20 |
|
273 |
+
| Reasoning | 2.45 | 3.50 | 2.55 | TBA | 4.35 |
|
274 |
+
| Roleplay | 4.90 | 6.40 | 6.35 | TBA | 7.35 |
|
275 |
+
| STEM | 4.20 | 4.78 | 4.28 | TBA | 7.80 |
|
276 |
+
| Writing | 3.80 | 6.65 | 4.10 | TBA | 8.50 |
|
277 |
+
| **Overall Average** | **2.92** | **4.05** | **3.52** | TBA | **6.06** |
|
278 |
|
279 |
As we can see, Ahma 3B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 3B model is not trained with code data. Ahma 3B also seemed to have problems with the fact that it started to constantly repeat the generated text in some evaluation examples, which affected the scoring. With the addition of a repetition penalty setting to the evaluation script generation method, the scores already improved significantly, so the Ahma 3B model should be used with better generation settings in real-world use compared to the settings used in this benchmark.
|
280 |
|