Update README.md
Browse files
README.md
CHANGED
@@ -68,7 +68,7 @@ Performance-wise:
|
|
68 |
**TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
|
69 |
[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
|
70 |
and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
|
71 |
-
We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
|
72 |
|
73 |
|
74 |
| Models | |↑ TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MMLU (ACC) |
|
@@ -92,7 +92,7 @@ Performance-wise:
|
|
92 |
[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
|
93 |
and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
|
94 |
**MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
|
95 |
-
We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
|
96 |
We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
|
97 |
|
98 |
|
@@ -100,7 +100,7 @@ Performance-wise:
|
|
100 |
|---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
|
101 |
| | |TC, Chat |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat |EN, Knowledge|EN, Knowledge|
|
102 |
| | |0 shot | 0 shot | 5 shot | 3 shot | 0 shot |0 shot | 0 shot | 5 shot |
|
103 |
-
| [gpt-3.5-turbo](https://openai.com) | |7.1 |
|
104 |
| [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) | 34B |6.9 | 54.87 | | | 36.81 |7.6 | 71.04 | |
|
105 |
| [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 14B |6.4 | 48.41 | | | 41.67 |7.2 | 64.91 | |
|
106 |
| [**Breeze-7B-Instruct-v0_1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0_1) | 7B |5.7 | 41.61 | | | 45.83 |7.1 | 63.26 | |
|
@@ -131,7 +131,7 @@ Performance-wise:
|
|
131 |
| Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 | 54.87 |
|
132 |
| Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 | 48.41 |
|
133 |
| Yi-6B-Chat | 37.80 | 51.74 | 45.36 | 44.25 | 44.79 |
|
134 |
-
| gpt-3.5-turbo | 41.
|
135 |
| **Breeze-7B-Instruct-v0_1** | 37.41 | 46.81 | 42.06 | 40.16 | 41.61 |
|
136 |
| **Breeze-7B-Instruct-64k-v0_1** | 37.88 | 46.35 | 40.31 | 39.40 | 40.99 |
|
137 |
| Qwen-7B-Chat | 35.44 | 46.22 | 38.35 | 40.06 | 40.02 |
|
|
|
68 |
**TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
|
69 |
[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
|
70 |
and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
|
71 |
+
We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**. All choice problems adapt the selection by the log-likelihood.
|
72 |
|
73 |
|
74 |
| Models | |↑ TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MMLU (ACC) |
|
|
|
92 |
[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
|
93 |
and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
|
94 |
**MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
|
95 |
+
We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**. All choice problems adapt the selection by the log-likelihood.
|
96 |
We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
|
97 |
|
98 |
|
|
|
100 |
|---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
|
101 |
| | |TC, Chat |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat |EN, Knowledge|EN, Knowledge|
|
102 |
| | |0 shot | 0 shot | 5 shot | 3 shot | 0 shot |0 shot | 0 shot | 5 shot |
|
103 |
+
| [gpt-3.5-turbo](https://openai.com) | |7.1 | 43.56 | | | 45.14 |7.9 | 67.09 | |
|
104 |
| [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) | 34B |6.9 | 54.87 | | | 36.81 |7.6 | 71.04 | |
|
105 |
| [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 14B |6.4 | 48.41 | | | 41.67 |7.2 | 64.91 | |
|
106 |
| [**Breeze-7B-Instruct-v0_1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0_1) | 7B |5.7 | 41.61 | | | 45.83 |7.1 | 63.26 | |
|
|
|
131 |
| Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 | 54.87 |
|
132 |
| Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 | 48.41 |
|
133 |
| Yi-6B-Chat | 37.80 | 51.74 | 45.36 | 44.25 | 44.79 |
|
134 |
+
| gpt-3.5-turbo | 41.58 | 48.52 | 40.96 | 43.18 | 43.56 |
|
135 |
| **Breeze-7B-Instruct-v0_1** | 37.41 | 46.81 | 42.06 | 40.16 | 41.61 |
|
136 |
| **Breeze-7B-Instruct-64k-v0_1** | 37.88 | 46.35 | 40.31 | 39.40 | 40.99 |
|
137 |
| Qwen-7B-Chat | 35.44 | 46.22 | 38.35 | 40.06 | 40.02 |
|