matsuo-lab
commited on
Commit
•
8ff1eaf
1
Parent(s):
043ea3b
Update README.md
Browse files
README.md
CHANGED
@@ -37,7 +37,20 @@ This repository provides a Japanese-centric multilingual GPT-NeoX model of 10 bi
|
|
37 |
|
38 |
# Benchmarking
|
39 |
|
40 |
-
* **Japanese benchmark**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
- *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
|
43 |
- *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
|
|
|
37 |
|
38 |
# Benchmarking
|
39 |
|
40 |
+
* **Japanese benchmark : JGLUE 8-task (2023-08-27)**
|
41 |
+
|
42 |
+
- *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
|
43 |
+
- *The 8-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, JSQuAD-1.1, jaqket_v2-0.2, xlsum_ja-1.0, xwinograd_ja, and mgsm-1.0.*
|
44 |
+
- *model loading is performed with float16, and evaluation is performed with template version 0.3 using the few-shot in-context learning.*
|
45 |
+
- *The number of few-shots is 3,3,3,2,1,1,0,5.*
|
46 |
+
- *special_tokens_map.json is modified to avoid errors during the evaluation of the second half benchmarks. As a result, the results of the first half benchmarks became slightly different.*
|
47 |
+
|
48 |
+
model | average | jcommonsenseqa | jnli | marc_ja | jsquad | jaqket_v2 | xlsum_ja | xwinograd_ja | mgsm
|
49 |
+
| :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
|
50 |
+
weblab-10b-instruction-sft | 59.11 | 74.62 | 66.56 | 95.49 | 78.34 | 63.32 | 20.57 | 71.95 | 2
|
51 |
+
weblab-10b | 50.74 | 66.58 | 53.74 | 82.07 | 62.94 | 56.19 | 10.03 | 71.95 | 2.4
|
52 |
+
|
53 |
+
* **Japanese benchmark : JGLUE 4-task (2023-08-18)**
|
54 |
|
55 |
- *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
|
56 |
- *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
|