shenzhi-wang commited on
Commit
f3d419d
Β·
verified Β·
1 Parent(s): c48a45d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -1
README.md CHANGED
@@ -95,6 +95,8 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
95
 
96
  #### 3.1.1 No Style Control
97
 
 
 
98
  | | Score | 95% CIs |
99
  | --------------------------------- | ------------------------ | ----------- |
100
  | **Xwen-72B-Chat** πŸ”‘ | **86.1** (Top-1 among πŸ”‘) | (-1.5, 1.7) |
@@ -112,8 +114,24 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
112
  | Yi-LargeπŸ”’ | 63.7 | (-2.6, 2.4) |
113
  | GLM-4-0520 πŸ”’ | 63.8 | (-2.9, 2.8) |
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  #### 3.1.2 Style Control
116
 
 
 
117
  | | Score | 95% CIs |
118
  | --------------------------------- | ------------------------ | ----------- |
119
  | **Xwen-72B-Chat** πŸ”‘ | **72.4** (Top-1 Among πŸ”‘) | (-4.3, 4.1) |
@@ -131,12 +149,25 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
131
  | Yi-Large-Preview πŸ”’ | 65.1 | (-2.5, 2.5) |
132
  | GLM-4-0520 πŸ”’ | 61.4 | (-2.6, 2.4) |
133
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  ### 3.2 AlignBench-v1.1
136
 
137
  > [!IMPORTANT]
138
  > We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
139
 
 
 
140
  | | Score |
141
  | ----------------------------- | ------------------------ |
142
  | **Xwen-72B-Chat** πŸ”‘ | **7.57** (Top-1 Among πŸ”‘) |
@@ -151,11 +182,20 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
151
  | Yi-Large-Preview πŸ”’ | 7.20 |
152
 
153
 
 
 
 
 
 
 
 
154
  ### 3.3 MT-Bench
155
 
156
  > [!IMPORTANT]
157
  > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
158
 
 
 
159
  | | Score |
160
  | ----------------------------- | ------------------------ |
161
  | **Xwen-72B-Chat** πŸ”‘ | **8.64** (Top-1 Among πŸ”‘) |
@@ -169,8 +209,12 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
169
  | Yi-Lightning πŸ”’ | **8.75** (Top-1 Among πŸ”’) |
170
  | Yi-Large-Preview πŸ”’ | 8.32 |
171
 
 
172
 
173
-
 
 
 
174
 
175
  ## References
176
 
 
95
 
96
  #### 3.1.1 No Style Control
97
 
98
+ **Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
99
+
100
  | | Score | 95% CIs |
101
  | --------------------------------- | ------------------------ | ----------- |
102
  | **Xwen-72B-Chat** πŸ”‘ | **86.1** (Top-1 among πŸ”‘) | (-1.5, 1.7) |
 
114
  | Yi-LargeπŸ”’ | 63.7 | (-2.6, 2.4) |
115
  | GLM-4-0520 πŸ”’ | 63.8 | (-2.9, 2.8) |
116
 
117
+
118
+ **Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
119
+
120
+ | | Score | 95% CIs |
121
+ | ----------------------- | -------- | ----------- |
122
+ | **Xwen-7B-Chat** πŸ”‘ | **59.4** | (-2.4, 2.1) |
123
+ | Qwen2.5-7B-Instruct πŸ”‘ | 50.4 | (-2.9, 2.5) |
124
+ | Gemma-2-27B-IT πŸ”‘ | 57.5 | (-2.1, 2.4) |
125
+ | Llama-3.1-8B-Instruct πŸ”‘ | 21.3 | (-1.9, 2.2) |
126
+ | Llama-3-8B-Instruct πŸ”‘ | 20.6 | (-2.0, 1.9) |
127
+ | Starling-LM-7B-beta πŸ”‘ | 23.0 | (-1.8, 1.8) |
128
+
129
+
130
+
131
  #### 3.1.2 Style Control
132
 
133
+ **Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
134
+
135
  | | Score | 95% CIs |
136
  | --------------------------------- | ------------------------ | ----------- |
137
  | **Xwen-72B-Chat** πŸ”‘ | **72.4** (Top-1 Among πŸ”‘) | (-4.3, 4.1) |
 
149
  | Yi-Large-Preview πŸ”’ | 65.1 | (-2.5, 2.5) |
150
  | GLM-4-0520 πŸ”’ | 61.4 | (-2.6, 2.4) |
151
 
152
+ **Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
153
+
154
+ | | Score | 95% CIs |
155
+ | ----------------------- | -------- | ----------- |
156
+ | **Xwen-7B-Chat** πŸ”‘ | **50.3** | (-3.8, 2.8) |
157
+ | Qwen2.5-7B-Instruct πŸ”‘ | 46.9 | (-3.1, 2.7) |
158
+ | Gemma-2-27B-IT πŸ”‘ | 47.5 | (-2.5, 2.7) |
159
+ | Llama-3.1-8B-Instruct πŸ”‘ | 18.3 | (-1.6, 1.6) |
160
+ | Llama-3-8B-Instruct πŸ”‘ | 19.8 | (-1.6, 1.9) |
161
+ | Starling-LM-7B-beta πŸ”‘ | 26.1 | (-2.6, 2.0) |
162
+
163
 
164
  ### 3.2 AlignBench-v1.1
165
 
166
  > [!IMPORTANT]
167
  > We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
168
 
169
+ **Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
170
+
171
  | | Score |
172
  | ----------------------------- | ------------------------ |
173
  | **Xwen-72B-Chat** πŸ”‘ | **7.57** (Top-1 Among πŸ”‘) |
 
182
  | Yi-Large-Preview πŸ”’ | 7.20 |
183
 
184
 
185
+ **Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
186
+
187
+ | | Score |
188
+ | ------------------ | -------- |
189
+ | **Xwen-7B-Chat** πŸ”‘ | **6.88** |
190
+ | Qwen2.5-7B-Chat πŸ”‘ | 6.56 |
191
+
192
  ### 3.3 MT-Bench
193
 
194
  > [!IMPORTANT]
195
  > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
196
 
197
+ **Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
198
+
199
  | | Score |
200
  | ----------------------------- | ------------------------ |
201
  | **Xwen-72B-Chat** πŸ”‘ | **8.64** (Top-1 Among πŸ”‘) |
 
209
  | Yi-Lightning πŸ”’ | **8.75** (Top-1 Among πŸ”’) |
210
  | Yi-Large-Preview πŸ”’ | 8.32 |
211
 
212
+ **Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
213
 
214
+ | | Score |
215
+ | ------------------ | -------- |
216
+ | **Xwen-7B-Chat** πŸ”‘ | **7.98** |
217
+ | Qwen2.5-7B-Chat πŸ”‘ | 7.71 |
218
 
219
  ## References
220