upstage/open-ko-llm-leaderboard · Announcement: Details on the Open Ko-LLM Leaderboard Season 2 Task

Tasks

We evaluate the large language models on the benchmarks below using the private repository forked from lm-evaluation-harness (maintained by EleutherAI), a framework for few-shot evaluation of language models.

Ko-GPQA
- Korean version of GPQA.
- GPQA is a highly challenging knowledge dataset with questions crafted by PhD-level domain experts in fields like biology, physics, and chemistry.
Ko-WinoGrande
- Korean version of WinoGrande.
- An adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
Ko-GSM8K
- Korean version of GSM8K.
- Diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
Ko-EQ-Bench
- Korean version of EQ-Bench.
- A benchmark designed to evaluate aspects of emotional intelligence by asking them to predict the intensity of emotional states of characters in a dialogue.
Ko-IFEval
- Korean version of IFEval.
- IFEval is a dataset designed to test a model’s ability to follow explicit instructions, such as “include keyword x” or “use format y.”
KorNAT-Knowledge
- Research paper: https://arxiv.org/abs/2402.13605
- Common knowledge refers to common knowledge broadly recognized and understood by the populace, often considered as basic knowledge. The questions are based on the compulsory education curriculum in Korea.
KorNAT-Social-Value
- Research paper: https://arxiv.org/abs/2402.13605
- Social values refer to the collective viewpoints of a nation’s citizens on critical issues to their society. The questions are made upon keywords extracted from monthly social conflict reports and the last 12 months of news articles. All questions have undergone two rounds of human revisions to ensure high quality and elaborateness.
Ko-Harmlessness
- Research paper: https://arxiv.org/abs/2402.13605
- Ko-Harmlessness is a test designed to evaluate how harmless an LLM is in potentially socially harmful areas. It assesses the model’s ability to select appropriate responses to queries in the domains of Bias, Hate, Sensitiveness, and Illegal content, using a multiple-choice format.
Ko-Helpfulness
- Research paper: https://arxiv.org/abs/2402.13605
- Ko-Helpfulness is a test designed to evaluate how well an LLM can judge the helpfulness of queries in accordance with user intent. It categorizes questions into Clarification, where specific information needs to be re-asked, and Nonsense, where the query is impossible, and assessing the model’s ability to select appropriate responses using a multiple-choice format.

Task Evaluations and Parameters

Ko-GPQA:

Dataset: Ko-GPQA-Diamond
Task type: Multiple-choice
The number of fewshot: 0-shot
Evaluation: acc_norm

Ko-WinoGrande:

Task type: Multiple-choice
The number of fewshot: 5-shot
Evaluation: acc

Ko-GSM8K:

Task type: Generation
The number of fewshot: 5-shot
Evaluation: exact-match (strict-match)

Ko-EQ-Bench:

Task type: Generation
The number of fewshot: 0-shot
Evaluation: eqbench (details in EQ-Bench)

Ko-IFEval:

Task type: Generation
The number of fewshot: 0-shot
Evaluation: inst_level_strict_acc and prompt_level_strict_acc (details in IFEval)

KorNAT-Knowledge:

Task type: Multiple-choice
The number of fewshot: 0-shot
Evaluation: acc_norm

KorNAT-Social-Value:

Task type: Multiple-choice
The number of fewshot: 0-shot
Evaluation: A-SVA (details in KorNAT)

Ko-Harmlessness:

Task type: Multiple-choice
The number of fewshot: 0-shot
Evaluation: acc_norm

Ko-Helpfulness:

Task type: Multiple-choice
The number of fewshot: 0-shot
Evaluation: acc_norm