matsuo-lab
/

weblab-10b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

matsuo-lab commited on Aug 27, 2023

Commit

8ff1eaf

•

1 Parent(s): 043ea3b

Update README.md

Files changed (1) hide show

README.md +14 -1

README.md CHANGED Viewed

@@ -37,7 +37,20 @@ This repository provides a Japanese-centric multilingual GPT-NeoX model of 10 bi
 # Benchmarking
-* **Japanese benchmark**
     - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
     - *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*

 # Benchmarking
+* **Japanese benchmark : JGLUE 8-task (2023-08-27)**
+    - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
+    - *The 8-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, JSQuAD-1.1, jaqket_v2-0.2, xlsum_ja-1.0, xwinograd_ja, and mgsm-1.0.*
+    - *model loading is performed with float16, and evaluation is performed with template version 0.3 using the few-shot in-context learning.*
+    - *The number of few-shots is 3,3,3,2,1,1,0,5.*
+    - *special_tokens_map.json is modified to avoid errors during the evaluation of the second half benchmarks. As a result, the results of the first half benchmarks became slightly different.*
+    model | average | jcommonsenseqa | jnli | marc_ja | jsquad | jaqket_v2 | xlsum_ja | xwinograd_ja | mgsm
+    | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- | :-- |
+    weblab-10b-instruction-sft | 59.11 | 74.62 | 66.56 | 95.49 | 78.34 | 63.32 | 20.57 | 71.95 | 2
+    weblab-10b | 50.74 | 66.58 | 53.74 | 82.07 | 62.94 | 56.19 | 10.03 | 71.95 | 2.4
+* **Japanese benchmark : JGLUE 4-task (2023-08-18)**
     - *We used [Stability-AI/lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness/tree/2f1583c0735eacdfdfa5b7d656074b69577b6774) library for evaluation.*
     - *The 4-task average accuracy is based on results of JCommonsenseQA-1.1, JNLI-1.1, MARC-ja-1.1, and JSQuAD-1.1.*