Taishi-N324 commited on
Commit
a0b070d
1 Parent(s): 8c2f17d

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -83,6 +83,43 @@ Our Swallow-MS-7b-v0.1 model has undergone continuous pre-training from the Mist
83
  | japanese-stablelm-base-gamma-7b|7B|0.1823|0.1915|
84
  | Swallow-MS-7b-v0.1 |7B|0.2305|0.2768|
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ## Usage
87
 
88
  First install additional dependencies in [requirements.txt](./requirements.txt):
 
83
  | japanese-stablelm-base-gamma-7b|7B|0.1823|0.1915|
84
  | Swallow-MS-7b-v0.1 |7B|0.2305|0.2768|
85
 
86
+ ## Evaluation Benchmarks
87
+
88
+ ### Japanese evaluation benchmarks
89
+
90
+ We used llm-jp-eval(v1.0.0) and JP Language Model Evaluation Harness(commit #9b42d41). The details are as follows:
91
+
92
+ - Multiple-choice question answering (JCommonsenseQA [Kurihara+, 2022])
93
+ - Open-ended question answering (JEMHopQA [Ishii+, 2023])
94
+ - Open-ended question answering (NIILC [Sekine, 2003])
95
+ - Machine reading comprehension (JSQuAD [Kurihara+, 2022])
96
+ - Automatic summarization (XL-Sum [Hasan+, 2021])
97
+ - Machine translation (WMT2020 ja-en [Barrault+, 2020])
98
+ - Machine translation (WMT2020 en-ja [Barrault+, 2020])
99
+ - Mathematics (MGSM [Shi+, 2023])
100
+
101
+
102
+ Notably, for natural language inference (NLI), which is often used as a benchmark for evaluating large language models, the models tended to predict labels in a biased manner, and scores could be inflated if the model’s biased label predictions coincidentally matched the correct answers. Therefore, it was found to be unstable (especially at 7B), and thus, we have excluded it from the evaluation benchmarks this time.
103
+
104
+ ### English evaluation benchmarks
105
+
106
+ We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows:
107
+
108
+ - Multiple-choice question answering (OpenBookQA [Mihaylov+, 2018])
109
+ - Open-ended question answering (TriviaQA [Joshi+, 2017])
110
+ - Machine reading comprehension (SQuAD 2.0 [Rajpurkar+, 2018])
111
+ - Common sense reasoning (XWINO [Tikhonov & Ryabinin, 2021])
112
+ - Natural language inference (HellaSwag [Zellers+, 2019])
113
+ - Mathematics (GSM8k [Cobbe+, 2021])
114
+
115
+ ### Code evaluation benchmarks
116
+
117
+ We utilized the Code Generation LM Evaluation Harness [Allal+, GitHub22] (commit #0261c52). The details are as follows:
118
+
119
+ - Python code generation (HumanEval [Allal+, GitHub22])
120
+ - Python code generation (JHumanEval [佐藤+, ANLP24])
121
+
122
+
123
  ## Usage
124
 
125
  First install additional dependencies in [requirements.txt](./requirements.txt):