Update README.md
Browse files
README.md
CHANGED
@@ -96,7 +96,7 @@ for output in outputs:
|
|
96 |
|
97 |
# Evaluation
|
98 |
|
99 |
-
We evaluated this model for output accuracy and the percentage of valid Japanese `<think>` sections using the first 50 rows of the
|
100 |
|
101 |
We compare this to the original R1 model and test in both regimes where repetition penalty is 1.0 and 1.1:
|
102 |
|
@@ -110,7 +110,7 @@ We compare this to the original R1 model and test in both regimes where repetiti
|
|
110 |
Code for the SakanaAI/gsm8k-ja-test_250-1319 evaluation can be found [here](https://drive.google.com/file/d/1gCzCJv5vasw8R3KVQimfoIDFyfxwxNvC/view?usp=sharing).
|
111 |
|
112 |
|
113 |
-
We further use the first 50 prompts from
|
114 |
This benchmark contains more varied and complex prompts, meaning this is a more realistic evaluation of how reliably this model can output Japanese.
|
115 |
|
116 |
| | Repetition Penalty | Valid Japanese `<think>` (%) |
|
|
|
96 |
|
97 |
# Evaluation
|
98 |
|
99 |
+
We evaluated this model for output accuracy and the percentage of valid Japanese `<think>` sections using the first 50 rows of the [SakanaAI/gsm8k-ja-test_250-1319](https://huggingface.co/datasets/SakanaAI/gsm8k-ja-test_250-1319) dataset.
|
100 |
|
101 |
We compare this to the original R1 model and test in both regimes where repetition penalty is 1.0 and 1.1:
|
102 |
|
|
|
110 |
Code for the SakanaAI/gsm8k-ja-test_250-1319 evaluation can be found [here](https://drive.google.com/file/d/1gCzCJv5vasw8R3KVQimfoIDFyfxwxNvC/view?usp=sharing).
|
111 |
|
112 |
|
113 |
+
We further use the first 50 prompts from [DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja](https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-reasoning-ja) to evaluate the percentage of valid Japanese `\<think\>` sections in model responses.
|
114 |
This benchmark contains more varied and complex prompts, meaning this is a more realistic evaluation of how reliably this model can output Japanese.
|
115 |
|
116 |
| | Repetition Penalty | Valid Japanese `<think>` (%) |
|