victormiller
commited on
Commit
•
400af6c
1
Parent(s):
4857ea8
Update README.md
Browse files
README.md
CHANGED
@@ -34,9 +34,18 @@ Evaluations include standard best practice benchmarks, medical, math, and coding
|
|
34 |
|
35 |
<center><img src="k2_table_of_tables.png" alt="k2 big eval table"/></center>
|
36 |
|
37 |
-
|
38 |
Detailed analysis can be found on the K2 Weights and Biases project [here](https://wandb.ai/llm360/K2?nw=29mu6l0zzqq)
|
39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
## K2 Gallery
|
42 |
The K2 gallery allows one to browse the output of various prompts on intermediate K2 checkpoints, which provides an intuitive understanding on how the model develops and improves over time. This is inspired by The Bloom Book.
|
|
|
34 |
|
35 |
<center><img src="k2_table_of_tables.png" alt="k2 big eval table"/></center>
|
36 |
|
|
|
37 |
Detailed analysis can be found on the K2 Weights and Biases project [here](https://wandb.ai/llm360/K2?nw=29mu6l0zzqq)
|
38 |
|
39 |
+
## Open LLM Leaderboard
|
40 |
+
| Evaluation | Score | Raw Score |
|
41 |
+
| ----------- | ----------- | ----------- |
|
42 |
+
| IFEval | 22.52 | 23 |
|
43 |
+
| BBH | 28.22 | 50 |
|
44 |
+
| Math Lvl 5 | 2.04 | 2 |
|
45 |
+
| GPQA | 3.58 | 28 |
|
46 |
+
| MUSR | 8.55 | 40 |
|
47 |
+
| MMLU-PRO | 22.27 | 30 |
|
48 |
+
| Average | 14.53 | 35.17 |
|
49 |
|
50 |
## K2 Gallery
|
51 |
The K2 gallery allows one to browse the output of various prompts on intermediate K2 checkpoints, which provides an intuitive understanding on how the model develops and improves over time. This is inspired by The Bloom Book.
|