matsuo-lab commited on
Commit
8ff1eaf
1 Parent(s): 043ea3b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -37,7 +37,20 @@ This repository provides a Japanese-centric multilingual GPT-NeoX model of 10 bi
37
 
38
  # Benchmarking
39
 
40
- * **Japanese benchmark**
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
43
  - *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*
 
37
 
38
  # Benchmarking
39
 
40
+ * **Japanese benchmark : JGLUE 8-task (2023-08-27)**
41
+
42
+ - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
43
+ - *The 8-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, JSQuAD-1.1, jaqket_v2-0.2, xlsum_ja-1.0, xwinograd_ja, and mgsm-1.0.*
44
+ - *model loading is performed with float16, and evaluation is performed with template version 0.3 using the few-shot in-context learning.*
45
+ - *The number of few-shots is 3,3,3,2,1,1,0,5.*
46
+ - *special_tokens_map.json is modified to avoid errors during the evaluation of the second half benchmarks. As a result, the results of the first half benchmarks became slightly different.*
47
+
48
+ model | average | jcommonsenseqa | jnli | marc_ja | jsquad | jaqket_v2 | xlsum_ja | xwinograd_ja | mgsm
49
+ | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
50
+ weblab-10b-instruction-sft | 59.11 | 74.62 | 66.56 | 95.49 | 78.34 | 63.32 | 20.57 | 71.95 | 2
51
+ weblab-10b | 50.74 | 66.58 | 53.74 | 82.07 | 62.94 | 56.19 | 10.03 | 71.95 | 2.4
52
+
53
+ * **Japanese benchmark : JGLUE 4-task (2023-08-18)**
54
 
55
  - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
56
  - *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*