stjohn2007
commited on
Commit
•
9786992
1
Parent(s):
4eac403
Update README.md
Browse files
README.md
CHANGED
@@ -32,7 +32,11 @@ This repository provides large language models developed by [TokyoTech-LLM](http
|
|
32 |
|
33 |
### MT-Bench JA
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
36 |
|
37 |
## Base Model Performance
|
38 |
|
@@ -52,6 +56,16 @@ We used llm-jp-eval(v1.0.0) and JP Language Model Evaluation Harness(commit #9b4
|
|
52 |
- Machine translation (WMT2020 en-ja [Barrault+, 2020])
|
53 |
- Mathematical reasoning (MGSM [Shi+, 2023])
|
54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
### English evaluation benchmarks
|
56 |
|
57 |
We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows:
|
|
|
32 |
|
33 |
### MT-Bench JA
|
34 |
|
35 |
+
* We will add the scores of existing models soon.
|
36 |
+
|
37 |
+
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
38 |
+
|---|---|---|---|---|---|---|---|---|---|
|
39 |
+
| Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750|
|
40 |
|
41 |
## Base Model Performance
|
42 |
|
|
|
56 |
- Machine translation (WMT2020 en-ja [Barrault+, 2020])
|
57 |
- Mathematical reasoning (MGSM [Shi+, 2023])
|
58 |
|
59 |
+
### MT-Bench JA
|
60 |
+
|
61 |
+
We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
|
62 |
+
We utilized the following artifacts:
|
63 |
+
|
64 |
+
- Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
|
65 |
+
- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
|
66 |
+
- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
|
67 |
+
- Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
|
68 |
+
|
69 |
### English evaluation benchmarks
|
70 |
|
71 |
We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows:
|