MediaTek-Research
/

Breeze-7B-Base-v0_1

@@ -68,7 +68,7 @@ Performance-wise:
 **TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
 [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
  and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
- We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
 | Models                                       |        |↑ TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MMLU (ACC) |
@@ -92,7 +92,7 @@ Performance-wise:
 [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
  and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
  **MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
- We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
@@ -100,7 +100,7 @@ Performance-wise:
 |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
 |                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
 |                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
-| [gpt-3.5-turbo](https://openai.com)                                                                     |        |7.1                 | 41.76        |              |             | 40.27       |7.9               |  70.00      |             |
 | [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat)                                                 | 34B    |6.9                 | 54.87        |              |             | 36.81       |7.6               |   71.04     |             |
 | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)                                              | 14B    |6.4                 | 48.41        |              |             | 41.67       |7.2               |    64.91    |             |
 | [**Breeze-7B-Instruct-v0_1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0_1)         | 7B     |5.7                 | 41.61        |              |             | 45.83       |7.1               |    63.26    |             |
@@ -131,7 +131,7 @@ Performance-wise:
 | Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      | 54.87   |
 | Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      | 48.41   |
 | Yi-6B-Chat                                          | 37.80        | 51.74          | 45.36      | 44.25      | 44.79   |
-| gpt-3.5-turbo                                       | 41.56        | 46.72          | 36.73      | 42.03      | 41.76   |
 | **Breeze-7B-Instruct-v0_1**                         | 37.41        | 46.81          | 42.06      | 40.16      | 41.61   |
 | **Breeze-7B-Instruct-64k-v0_1**                     | 37.88        | 46.35          | 40.31      | 39.40      | 40.99   |
 | Qwen-7B-Chat                                        | 35.44        | 46.22          | 38.35      | 40.06      | 40.02   |

 **TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
 [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
  and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
+ We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**. All choice problems adapt the selection by the log-likelihood.
 | Models                                       |        |↑ TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MMLU (ACC) |
 [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
  and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
  **MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
+ We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**. All choice problems adapt the selection by the log-likelihood.
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
 |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
 |                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
 |                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
+| [gpt-3.5-turbo](https://openai.com)                                                                     |        |7.1                 | 43.56        |              |             | 45.14       |7.9               |  67.09      |             |
 | [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat)                                                 | 34B    |6.9                 | 54.87        |              |             | 36.81       |7.6               |   71.04     |             |
 | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)                                              | 14B    |6.4                 | 48.41        |              |             | 41.67       |7.2               |    64.91    |             |
 | [**Breeze-7B-Instruct-v0_1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0_1)         | 7B     |5.7                 | 41.61        |              |             | 45.83       |7.1               |    63.26    |             |
 | Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      | 54.87   |
 | Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      | 48.41   |
 | Yi-6B-Chat                                          | 37.80        | 51.74          | 45.36      | 44.25      | 44.79   |
+| gpt-3.5-turbo                                       | 41.58        | 48.52          | 40.96      | 43.18      | 43.56   |
 | **Breeze-7B-Instruct-v0_1**                         | 37.41        | 46.81          | 42.06      | 40.16      | 41.61   |
 | **Breeze-7B-Instruct-64k-v0_1**                     | 37.88        | 46.35          | 40.31      | 39.40      | 40.99   |
 | Qwen-7B-Chat                                        | 35.44        | 46.22          | 38.35      | 40.06      | 40.02   |