MediaTek-Research
/

Breeze-7B-Instruct-v0_1

@@ -96,7 +96,7 @@ Performance-wise:
 [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
  and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
  **MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
- We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
@@ -104,7 +104,7 @@ Performance-wise:
 |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
 |                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
 |                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
-| [gpt-3.5-turbo](https://openai.com)                                                                     |        |7.1                 | 41.76        |              |             | 40.27       |7.9               |  70.00      |             |
 | [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat)                                                 | 34B    |6.9                 | 54.87        |              |             | 36.81       |7.6               |   71.04     |             |
 | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)                                              | 14B    |6.4                 | 48.41        |              |             | 41.67       |7.2               |    64.91    |             |
 | [**Breeze-7B-Instruct-v0_1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0_1)         | 7B     |5.7                 | 41.61        |              |             | 45.83       |7.1               |    63.26    |             |
@@ -135,7 +135,7 @@ Performance-wise:
 | Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      | 54.87   |
 | Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      | 48.41   |
 | Yi-6B-Chat                                          | 37.80        | 51.74          | 45.36      | 44.25      | 44.79   |
-| gpt-3.5-turbo                                       | 41.56        | 46.72          | 36.73      | 42.03      | 41.76   |
 | **Breeze-7B-Instruct-v0_1**                         | 37.41        | 46.81          | 42.06      | 40.16      | 41.61   |
 | **Breeze-7B-Instruct-64k-v0_1**                     | 37.88        | 46.35          | 40.31      | 39.40      | 40.99   |
 | Qwen-7B-Chat                                        | 35.44        | 46.22          | 38.35      | 40.06      | 40.02   |

 [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
  and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
  **MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
+ We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**. All choice problems adapt the selection by the log-likelihood.
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) (GPT4 as judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
 |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
 |                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
 |                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
+| [gpt-3.5-turbo](https://openai.com)                                                                     |        |7.1                 | 43.56        |              |             | 45.14       |7.9               |  67.09      |             |
 | [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat)                                                 | 34B    |6.9                 | 54.87        |              |             | 36.81       |7.6               |   71.04     |             |
 | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)                                              | 14B    |6.4                 | 48.41        |              |             | 41.67       |7.2               |    64.91    |             |
 | [**Breeze-7B-Instruct-v0_1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0_1)         | 7B     |5.7                 | 41.61        |              |             | 45.83       |7.1               |    63.26    |             |
 | Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      | 54.87   |
 | Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      | 48.41   |
 | Yi-6B-Chat                                          | 37.80        | 51.74          | 45.36      | 44.25      | 44.79   |
+| gpt-3.5-turbo                                       | 41.58        | 48.52          | 40.96      | 43.18      | 43.56   |
 | **Breeze-7B-Instruct-v0_1**                         | 37.41        | 46.81          | 42.06      | 40.16      | 41.61   |
 | **Breeze-7B-Instruct-64k-v0_1**                     | 37.88        | 46.35          | 40.31      | 39.40      | 40.99   |
 | Qwen-7B-Chat                                        | 35.44        | 46.22          | 38.35      | 40.06      | 40.02   |