xwen-team
/

Xwen-72B-Chat

@@ -95,6 +95,8 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
 #### 3.1.1 No Style Control
 |                                   | Score                    | 95% CIs     |
 | --------------------------------- | ------------------------ | ----------- |
 | **Xwen-72B-Chat** 🔑               | **86.1** (Top-1 among 🔑) | (-1.5, 1.7) |
@@ -112,8 +114,24 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
 | Yi-Large🔒                         | 63.7                     | (-2.6, 2.4) |
 | GLM-4-0520 🔒                      | 63.8                     | (-2.9, 2.8) |
 #### 3.1.2 Style Control
 |                                   | Score                    | 95% CIs     |
 | --------------------------------- | ------------------------ | ----------- |
 | **Xwen-72B-Chat** 🔑               | **72.4** (Top-1 Among 🔑) | (-4.3, 4.1) |
@@ -131,12 +149,25 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
 | Yi-Large-Preview 🔒                | 65.1                     | (-2.5, 2.5) |
 | GLM-4-0520 🔒                      | 61.4                     | (-2.6, 2.4) |
 ### 3.2 AlignBench-v1.1
 > [!IMPORTANT]
 > We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
 |                               | Score                    |
 | ----------------------------- | ------------------------ |
 | **Xwen-72B-Chat** 🔑           | **7.57** (Top-1 Among 🔑) |
@@ -151,11 +182,20 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
 | Yi-Large-Preview 🔒            | 7.20                     |
 ### 3.3 MT-Bench
 > [!IMPORTANT]
 > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
 |                               | Score                    |
 | ----------------------------- | ------------------------ |
 | **Xwen-72B-Chat** 🔑           | **8.64** (Top-1 Among 🔑) |
@@ -169,8 +209,12 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
 | Yi-Lightning 🔒                | **8.75** (Top-1 Among 🔒) |
 | Yi-Large-Preview 🔒            | 8.32                     |
 ## References

 #### 3.1.1 No Style Control
+**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
 |                                   | Score                    | 95% CIs     |
 | --------------------------------- | ------------------------ | ----------- |
 | **Xwen-72B-Chat** 🔑               | **86.1** (Top-1 among 🔑) | (-1.5, 1.7) |
 | Yi-Large🔒                         | 63.7                     | (-2.6, 2.4) |
 | GLM-4-0520 🔒                      | 63.8                     | (-2.9, 2.8) |
+**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
+|                         | Score    | 95% CIs     |
+| ----------------------- | -------- | ----------- |
+| **Xwen-7B-Chat** 🔑      | **59.4** | (-2.4, 2.1) |
+| Qwen2.5-7B-Instruct 🔑   | 50.4     | (-2.9, 2.5) |
+| Gemma-2-27B-IT 🔑        | 57.5     | (-2.1, 2.4) |
+| Llama-3.1-8B-Instruct 🔑 | 21.3     | (-1.9, 2.2) |
+| Llama-3-8B-Instruct 🔑   | 20.6     | (-2.0, 1.9) |
+| Starling-LM-7B-beta 🔑   | 23.0     | (-1.8, 1.8) |
 #### 3.1.2 Style Control
+**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
 |                                   | Score                    | 95% CIs     |
 | --------------------------------- | ------------------------ | ----------- |
 | **Xwen-72B-Chat** 🔑               | **72.4** (Top-1 Among 🔑) | (-4.3, 4.1) |
 | Yi-Large-Preview 🔒                | 65.1                     | (-2.5, 2.5) |
 | GLM-4-0520 🔒                      | 61.4                     | (-2.6, 2.4) |
+**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
+|                         | Score    | 95% CIs     |
+| ----------------------- | -------- | ----------- |
+| **Xwen-7B-Chat** 🔑      | **50.3** | (-3.8, 2.8) |
+| Qwen2.5-7B-Instruct 🔑   | 46.9     | (-3.1, 2.7) |
+| Gemma-2-27B-IT 🔑        | 47.5     | (-2.5, 2.7) |
+| Llama-3.1-8B-Instruct 🔑 | 18.3     | (-1.6, 1.6) |
+| Llama-3-8B-Instruct 🔑   | 19.8     | (-1.6, 1.9) |
+| Starling-LM-7B-beta 🔑   | 26.1     | (-2.6, 2.0) |
 ### 3.2 AlignBench-v1.1
 > [!IMPORTANT]
 > We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
+**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
 |                               | Score                    |
 | ----------------------------- | ------------------------ |
 | **Xwen-72B-Chat** 🔑           | **7.57** (Top-1 Among 🔑) |
 | Yi-Large-Preview 🔒            | 7.20                     |
+**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
+|                    | Score    |
+| ------------------ | -------- |
+| **Xwen-7B-Chat** 🔑 | **6.88** |
+| Qwen2.5-7B-Chat 🔑  | 6.56     |
 ### 3.3 MT-Bench
 > [!IMPORTANT]
 > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
+**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
 |                               | Score                    |
 | ----------------------------- | ------------------------ |
 | **Xwen-72B-Chat** 🔑           | **8.64** (Top-1 Among 🔑) |
 | Yi-Lightning 🔒                | **8.75** (Top-1 Among 🔒) |
 | Yi-Large-Preview 🔒            | 8.32                     |
+**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
+|                    | Score    |
+| ------------------ | -------- |
+| **Xwen-7B-Chat** 🔑 | **7.98** |
+| Qwen2.5-7B-Chat 🔑  | 7.71     |
 ## References