ArkaAbacus
commited on
Commit
•
a840a3d
1
Parent(s):
2a23d55
Update README.md
Browse files
README.md
CHANGED
@@ -28,6 +28,23 @@ The model outperforms Llama-3-70B-Instruct substantially, and is on par with GPT
|
|
28 |
|
29 |
## Evaluation
|
30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
### MT-Bench
|
32 |
|
33 |
```
|
|
|
28 |
|
29 |
## Evaluation
|
30 |
|
31 |
+
### Arena-Hard
|
32 |
+
|
33 |
+
Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge))
|
34 |
+
|
35 |
+
| Model | Score | 95% Confidence Interval | Average Tokens |
|
36 |
+
| :---- | ---------: | ----------: | ------: |
|
37 |
+
| GPT-4-Turbo-2024-04-09 | 82.6 | (-1.8, 1.6) | 662 |
|
38 |
+
| Claude-3-Opus-20240229 | 60.4 | (-3.3, 2.4) | 541 |
|
39 |
+
| **Smaug-Llama-3-70B-Instruct Score** | 56.7 | (-2.2, 2.6) | 661 |
|
40 |
+
| Llama-3-70B-Instruct | 41.1 | (-2.5, 2.4) | 583 |
|
41 |
+
| Mistral-Large-2402 | 37.7 | (-1.9, 2.6) | 400 |
|
42 |
+
| Mixtral-8x22B-Instruct-v0.1 | 36.4 | (-2.7, 2.9) | 430 |
|
43 |
+
| Qwen1.5-72B-Chat | 36.1 | (-2.5, 2.2) | 474 |
|
44 |
+
| Command-R-Plus | 33.1 | (-2.1, 2.2) | 541 |
|
45 |
+
| Mistral-Medium | 31.9 | (-2.3, 2.4) | 485 |
|
46 |
+
| GPT-3.5-Turbo-0613 | 24.8 | (-1.6, 2.0) | 401 |
|
47 |
+
|
48 |
### MT-Bench
|
49 |
|
50 |
```
|