tokyotech-llm
/

Swallow-MS-7b-v0.1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Taishi-N324 commited on Mar 19

Commit

a0b070d

•

1 Parent(s): 8c2f17d

Upload README.md

Files changed (1) hide show

README.md +37 -0

README.md CHANGED Viewed

@@ -83,6 +83,43 @@ Our Swallow-MS-7b-v0.1 model has undergone continuous pre-training from the Mist
 | japanese-stablelm-base-gamma-7b|7B|0.1823|0.1915|
 | Swallow-MS-7b-v0.1 |7B|0.2305|0.2768|
 ## Usage
 First install additional dependencies in [requirements.txt](./requirements.txt):

 | japanese-stablelm-base-gamma-7b|7B|0.1823|0.1915|
 | Swallow-MS-7b-v0.1 |7B|0.2305|0.2768|
+## Evaluation Benchmarks
+### Japanese evaluation benchmarks
+We used llm-jp-eval(v1.0.0) and JP Language Model Evaluation Harness(commit #9b42d41). The details are as follows:
+- Multiple-choice question answering (JCommonsenseQA [Kurihara+, 2022])
+- Open-ended question answering (JEMHopQA [Ishii+, 2023])
+- Open-ended question answering (NIILC [Sekine, 2003])
+- Machine reading comprehension (JSQuAD [Kurihara+, 2022])
+- Automatic summarization (XL-Sum [Hasan+, 2021])
+- Machine translation (WMT2020 ja-en [Barrault+, 2020])
+- Machine translation (WMT2020 en-ja [Barrault+, 2020])
+- Mathematics (MGSM [Shi+, 2023])
+Notably, for natural language inference (NLI), which is often used as a benchmark for evaluating large language models, the models tended to predict labels in a biased manner, and scores could be inflated if the model’s biased label predictions coincidentally matched the correct answers. Therefore, it was found to be unstable (especially at 7B), and thus, we have excluded it from the evaluation benchmarks this time.
+### English evaluation benchmarks
+We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows:
+- Multiple-choice question answering (OpenBookQA [Mihaylov+, 2018])
+- Open-ended question answering (TriviaQA [Joshi+, 2017])
+- Machine reading comprehension (SQuAD 2.0 [Rajpurkar+, 2018])
+- Common sense reasoning (XWINO [Tikhonov & Ryabinin, 2021])
+- Natural language inference (HellaSwag [Zellers+, 2019])
+- Mathematics (GSM8k [Cobbe+, 2021])
+### Code evaluation benchmarks
+We utilized the Code Generation LM Evaluation Harness [Allal+, GitHub22] (commit #0261c52). The details are as follows:
+- Python code generation (HumanEval [Allal+, GitHub22])
+- Python code generation (JHumanEval [佐藤+, ANLP24])
 ## Usage
 First install additional dependencies in [requirements.txt](./requirements.txt):