Update README.md
Browse files
README.md
CHANGED
@@ -95,6 +95,8 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
|
|
95 |
|
96 |
#### 3.1.1 No Style Control
|
97 |
|
|
|
|
|
98 |
| | Score | 95% CIs |
|
99 |
| --------------------------------- | ------------------------ | ----------- |
|
100 |
| **Xwen-72B-Chat** π | **86.1** (Top-1 among π) | (-1.5, 1.7) |
|
@@ -112,8 +114,24 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
|
|
112 |
| Yi-Largeπ | 63.7 | (-2.6, 2.4) |
|
113 |
| GLM-4-0520 π | 63.8 | (-2.9, 2.8) |
|
114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
#### 3.1.2 Style Control
|
116 |
|
|
|
|
|
117 |
| | Score | 95% CIs |
|
118 |
| --------------------------------- | ------------------------ | ----------- |
|
119 |
| **Xwen-72B-Chat** π | **72.4** (Top-1 Among π) | (-4.3, 4.1) |
|
@@ -131,12 +149,25 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
|
|
131 |
| Yi-Large-Preview π | 65.1 | (-2.5, 2.5) |
|
132 |
| GLM-4-0520 π | 61.4 | (-2.6, 2.4) |
|
133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
|
135 |
### 3.2 AlignBench-v1.1
|
136 |
|
137 |
> [!IMPORTANT]
|
138 |
> We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
|
139 |
|
|
|
|
|
140 |
| | Score |
|
141 |
| ----------------------------- | ------------------------ |
|
142 |
| **Xwen-72B-Chat** π | **7.57** (Top-1 Among π) |
|
@@ -151,11 +182,20 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
|
|
151 |
| Yi-Large-Preview π | 7.20 |
|
152 |
|
153 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
154 |
### 3.3 MT-Bench
|
155 |
|
156 |
> [!IMPORTANT]
|
157 |
> We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
|
158 |
|
|
|
|
|
159 |
| | Score |
|
160 |
| ----------------------------- | ------------------------ |
|
161 |
| **Xwen-72B-Chat** π | **8.64** (Top-1 Among π) |
|
@@ -169,8 +209,12 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
|
|
169 |
| Yi-Lightning π | **8.75** (Top-1 Among π) |
|
170 |
| Yi-Large-Preview π | 8.32 |
|
171 |
|
|
|
172 |
|
173 |
-
|
|
|
|
|
|
|
174 |
|
175 |
## References
|
176 |
|
|
|
95 |
|
96 |
#### 3.1.1 No Style Control
|
97 |
|
98 |
+
**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
|
99 |
+
|
100 |
| | Score | 95% CIs |
|
101 |
| --------------------------------- | ------------------------ | ----------- |
|
102 |
| **Xwen-72B-Chat** π | **86.1** (Top-1 among π) | (-1.5, 1.7) |
|
|
|
114 |
| Yi-Largeπ | 63.7 | (-2.6, 2.4) |
|
115 |
| GLM-4-0520 π | 63.8 | (-2.9, 2.8) |
|
116 |
|
117 |
+
|
118 |
+
**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
|
119 |
+
|
120 |
+
| | Score | 95% CIs |
|
121 |
+
| ----------------------- | -------- | ----------- |
|
122 |
+
| **Xwen-7B-Chat** π | **59.4** | (-2.4, 2.1) |
|
123 |
+
| Qwen2.5-7B-Instruct π | 50.4 | (-2.9, 2.5) |
|
124 |
+
| Gemma-2-27B-IT π | 57.5 | (-2.1, 2.4) |
|
125 |
+
| Llama-3.1-8B-Instruct π | 21.3 | (-1.9, 2.2) |
|
126 |
+
| Llama-3-8B-Instruct π | 20.6 | (-2.0, 1.9) |
|
127 |
+
| Starling-LM-7B-beta π | 23.0 | (-1.8, 1.8) |
|
128 |
+
|
129 |
+
|
130 |
+
|
131 |
#### 3.1.2 Style Control
|
132 |
|
133 |
+
**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
|
134 |
+
|
135 |
| | Score | 95% CIs |
|
136 |
| --------------------------------- | ------------------------ | ----------- |
|
137 |
| **Xwen-72B-Chat** π | **72.4** (Top-1 Among π) | (-4.3, 4.1) |
|
|
|
149 |
| Yi-Large-Preview π | 65.1 | (-2.5, 2.5) |
|
150 |
| GLM-4-0520 π | 61.4 | (-2.6, 2.4) |
|
151 |
|
152 |
+
**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
|
153 |
+
|
154 |
+
| | Score | 95% CIs |
|
155 |
+
| ----------------------- | -------- | ----------- |
|
156 |
+
| **Xwen-7B-Chat** π | **50.3** | (-3.8, 2.8) |
|
157 |
+
| Qwen2.5-7B-Instruct π | 46.9 | (-3.1, 2.7) |
|
158 |
+
| Gemma-2-27B-IT π | 47.5 | (-2.5, 2.7) |
|
159 |
+
| Llama-3.1-8B-Instruct π | 18.3 | (-1.6, 1.6) |
|
160 |
+
| Llama-3-8B-Instruct π | 19.8 | (-1.6, 1.9) |
|
161 |
+
| Starling-LM-7B-beta π | 26.1 | (-2.6, 2.0) |
|
162 |
+
|
163 |
|
164 |
### 3.2 AlignBench-v1.1
|
165 |
|
166 |
> [!IMPORTANT]
|
167 |
> We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
|
168 |
|
169 |
+
**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
|
170 |
+
|
171 |
| | Score |
|
172 |
| ----------------------------- | ------------------------ |
|
173 |
| **Xwen-72B-Chat** π | **7.57** (Top-1 Among π) |
|
|
|
182 |
| Yi-Large-Preview π | 7.20 |
|
183 |
|
184 |
|
185 |
+
**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
|
186 |
+
|
187 |
+
| | Score |
|
188 |
+
| ------------------ | -------- |
|
189 |
+
| **Xwen-7B-Chat** π | **6.88** |
|
190 |
+
| Qwen2.5-7B-Chat π | 6.56 |
|
191 |
+
|
192 |
### 3.3 MT-Bench
|
193 |
|
194 |
> [!IMPORTANT]
|
195 |
> We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
|
196 |
|
197 |
+
**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
|
198 |
+
|
199 |
| | Score |
|
200 |
| ----------------------------- | ------------------------ |
|
201 |
| **Xwen-72B-Chat** π | **8.64** (Top-1 Among π) |
|
|
|
209 |
| Yi-Lightning π | **8.75** (Top-1 Among π) |
|
210 |
| Yi-Large-Preview π | 8.32 |
|
211 |
|
212 |
+
**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
|
213 |
|
214 |
+
| | Score |
|
215 |
+
| ------------------ | -------- |
|
216 |
+
| **Xwen-7B-Chat** π | **7.98** |
|
217 |
+
| Qwen2.5-7B-Chat π | 7.71 |
|
218 |
|
219 |
## References
|
220 |
|