YC-Chen commited on
Commit
80e1ac8
1 Parent(s): 2c85a13

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -96,7 +96,7 @@ Performance-wise:
96
  [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
97
  and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
98
  **MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
99
- We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
100
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
101
 
102
 
@@ -104,7 +104,7 @@ Performance-wise:
104
  |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
105
  | | |TC, Chat |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat |EN, Knowledge|EN, Knowledge|
106
  | | |0 shot | 0 shot | 5 shot | 3 shot | 0 shot |0 shot | 0 shot | 5 shot |
107
- | [gpt-3.5-turbo](https://openai.com) | |7.1 | 41.76 | | | 40.27 |7.9 | 70.00 | |
108
  | [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) | 34B |6.9 | 54.87 | | | 36.81 |7.6 | 71.04 | |
109
  | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 14B |6.4 | 48.41 | | | 41.67 |7.2 | 64.91 | |
110
  | [**Breeze-7B-Instruct-v0_1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0_1) | 7B |5.7 | 41.61 | | | 45.83 |7.1 | 63.26 | |
@@ -135,7 +135,7 @@ Performance-wise:
135
  | Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 | 54.87 |
136
  | Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 | 48.41 |
137
  | Yi-6B-Chat | 37.80 | 51.74 | 45.36 | 44.25 | 44.79 |
138
- | gpt-3.5-turbo | 41.56 | 46.72 | 36.73 | 42.03 | 41.76 |
139
  | **Breeze-7B-Instruct-v0_1** | 37.41 | 46.81 | 42.06 | 40.16 | 41.61 |
140
  | **Breeze-7B-Instruct-64k-v0_1** | 37.88 | 46.35 | 40.31 | 39.40 | 40.99 |
141
  | Qwen-7B-Chat | 35.44 | 46.22 | 38.35 | 40.06 | 40.02 |
 
96
  [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
97
  and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
98
  **MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
99
+ We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**. All choice problems adapt the selection by the log-likelihood.
100
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
101
 
102
 
 
104
  |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
105
  | | |TC, Chat |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat |EN, Knowledge|EN, Knowledge|
106
  | | |0 shot | 0 shot | 5 shot | 3 shot | 0 shot |0 shot | 0 shot | 5 shot |
107
+ | [gpt-3.5-turbo](https://openai.com) | |7.1 | 43.56 | | | 45.14 |7.9 | 67.09 | |
108
  | [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) | 34B |6.9 | 54.87 | | | 36.81 |7.6 | 71.04 | |
109
  | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 14B |6.4 | 48.41 | | | 41.67 |7.2 | 64.91 | |
110
  | [**Breeze-7B-Instruct-v0_1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0_1) | 7B |5.7 | 41.61 | | | 45.83 |7.1 | 63.26 | |
 
135
  | Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 | 54.87 |
136
  | Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 | 48.41 |
137
  | Yi-6B-Chat | 37.80 | 51.74 | 45.36 | 44.25 | 44.79 |
138
+ | gpt-3.5-turbo | 41.58 | 48.52 | 40.96 | 43.18 | 43.56 |
139
  | **Breeze-7B-Instruct-v0_1** | 37.41 | 46.81 | 42.06 | 40.16 | 41.61 |
140
  | **Breeze-7B-Instruct-64k-v0_1** | 37.88 | 46.35 | 40.31 | 39.40 | 40.99 |
141
  | Qwen-7B-Chat | 35.44 | 46.22 | 38.35 | 40.06 | 40.02 |