k-eval
Collection
The k-eval model is an evaluation model designed to replace GPT-4 evaluation or human evaluation.
•
2 items
•
Updated
keval is an evaluation model that learned the prompt and dataset used in the benchmark for evaluating Korean language models among various methods of evaluating models with chatgpt to compensate for the shortcomings of the existing lm-evaluation-harness.
The current model is private.
Now that the new version(keval-9b) has been released, the previous version will be changed so that anyone can use it.
model | acc | wrong | diff-0 | diff-1 | diff-2 | diff-3 | diff-4 | diff-5 | diff-6 | diff-7 | diff-8 | diff-9 | length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Meta-Llama-3-8B-Instruct-keval_datasets_small.jsonl | 0.45 | 0.38 | 0.06 | 0.18 | 0.17 | 0.09 | 0.05 | 0.04 | 0.02 | 0 | 0 | 0 | 100 |
1 | Mistral-7B-Instruct-v0.2-keval_datasets_small.jsonl | 0.55 | 0.27 | 0.18 | 0.2 | 0.09 | 0.06 | 0.05 | 0.03 | 0.04 | 0.04 | 0.04 | 0 | 100 |
2 | Mistral-7B-Instruct-v0.3-keval_datasets_small.jsonl | 0.71 | 0.05 | 0.26 | 0.26 | 0.13 | 0.08 | 0.07 | 0.04 | 0.03 | 0.03 | 0.04 | 0 | 100 |
3 | aya-23-8B-keval_datasets_small.jsonl | 0.7 | 0.02 | 0.17 | 0.24 | 0.16 | 0.1 | 0.13 | 0.06 | 0.06 | 0.04 | 0.01 | 0 | 100 |
4 | gemma-2-27b-it-keval_datasets_small.jsonl | 0.76 | 0.11 | 0.2 | 0.35 | 0.18 | 0.1 | 0.03 | 0.01 | 0.02 | 0 | 0 | 0 | 100 |
5 | gemma-2-9b-it-keval_datasets_small.jsonl | 0.83 | 0.04 | 0.26 | 0.42 | 0.15 | 0.05 | 0.02 | 0.05 | 0.01 | 0 | 0 | 0 | 100 |
6 | keval-7b-keval_datasets_small.jsonl | 0.84 | 0 | 0.28 | 0.41 | 0.11 | 0.06 | 0.05 | 0.03 | 0.02 | 0.03 | 0.01 | 0 | 100 |
7 | keval-9b-keval_datasets_small.jsonl | 0.91 | 0 | 0.43 | 0.38 | 0.1 | 0.05 | 0.03 | 0.01 | 0 | 0 | 0 | 0 | 100 |