stjohn2007 commited on
Commit
9786992
1 Parent(s): 4eac403

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -32,7 +32,11 @@ This repository provides large language models developed by [TokyoTech-LLM](http
32
 
33
  ### MT-Bench JA
34
 
35
- TODO
 
 
 
 
36
 
37
  ## Base Model Performance
38
 
@@ -52,6 +56,16 @@ We used llm-jp-eval(v1.0.0) and JP Language Model Evaluation Harness(commit #9b4
52
  - Machine translation (WMT2020 en-ja [Barrault+, 2020])
53
  - Mathematical reasoning (MGSM [Shi+, 2023])
54
 
 
 
 
 
 
 
 
 
 
 
55
  ### English evaluation benchmarks
56
 
57
  We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows:
 
32
 
33
  ### MT-Bench JA
34
 
35
+ * We will add the scores of existing models soon.
36
+
37
+ |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
38
+ |---|---|---|---|---|---|---|---|---|---|
39
+ | Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750|
40
 
41
  ## Base Model Performance
42
 
 
56
  - Machine translation (WMT2020 en-ja [Barrault+, 2020])
57
  - Mathematical reasoning (MGSM [Shi+, 2023])
58
 
59
+ ### MT-Bench JA
60
+
61
+ We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
62
+ We utilized the following artifacts:
63
+
64
+ - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
65
+ - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
66
+ - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
67
+ - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
68
+
69
  ### English evaluation benchmarks
70
 
71
  We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows: