tokyotech-llm
/

Swallow-MS-7b-instruct-v0.1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

stjohn2007 commited on Apr 26

Commit

9786992

•

1 Parent(s): 4eac403

Update README.md

Files changed (1) hide show

README.md +15 -1

README.md CHANGED Viewed

@@ -32,7 +32,11 @@ This repository provides large language models developed by [TokyoTech-LLM](http
 ### MT-Bench JA
-TODO
 ## Base Model Performance
@@ -52,6 +56,16 @@ We used llm-jp-eval(v1.0.0) and JP Language Model Evaluation Harness(commit #9b4
 - Machine translation (WMT2020 en-ja [Barrault+, 2020])
 - Mathematical reasoning (MGSM [Shi+, 2023])
 ### English evaluation benchmarks
 We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows:

 ### MT-Bench JA
+* We will add the scores of existing models soon.
+|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
+|---|---|---|---|---|---|---|---|---|---|
+| Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750|
 ## Base Model Performance
 - Machine translation (WMT2020 en-ja [Barrault+, 2020])
 - Mathematical reasoning (MGSM [Shi+, 2023])
+### MT-Bench JA
+We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
+We utilized the following artifacts:
+- Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
+- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
+- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
+- Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
 ### English evaluation benchmarks
 We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows: