ArkaAbacus commited on
Commit
a4e1166
1 Parent(s): 670b7fa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -13,7 +13,7 @@ Note: These results are with corrected parsing for BBH from Eleuther's [lm-evalu
13
 
14
  | Model | Groups | Version | Filter | n-shot | Metric | Value | | Stderr |
15
  |----------------------------|--------|---------|------------|--------|-------------|--------|---|--------|
16
- | Smaug-Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8241 | ± | 0.0042 |
17
  | Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8036 | ± | 0.0044 |
18
 
19
  #### Breakdown:
@@ -84,6 +84,14 @@ Qwen2-72B-Instruct:
84
  | - bbh_cot_fewshot_web_of_lies | 2 | get-answer | 3 | exact_match | 1.0000 | 0.0000 |
85
  | - bbh_cot_fewshot_word_sorting | 2 | get-answer | 3 | exact_match | 0.6680 | 0.0298 |
86
 
 
 
 
 
 
 
 
 
87
  ## Arena-Hard
88
 
89
  Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.
 
13
 
14
  | Model | Groups | Version | Filter | n-shot | Metric | Value | | Stderr |
15
  |----------------------------|--------|---------|------------|--------|-------------|--------|---|--------|
16
+ | **Smaug-Qwen2-72B-Instruct** | bbh | N/A | get-answer | 3 | exact_match | 0.8241 | ± | 0.0042 |
17
  | Qwen2-72B-Instruct | bbh | N/A | get-answer | 3 | exact_match | 0.8036 | ± | 0.0044 |
18
 
19
  #### Breakdown:
 
84
  | - bbh_cot_fewshot_web_of_lies | 2 | get-answer | 3 | exact_match | 1.0000 | 0.0000 |
85
  | - bbh_cot_fewshot_word_sorting | 2 | get-answer | 3 | exact_match | 0.6680 | 0.0298 |
86
 
87
+ ## LiveCodeBench
88
+
89
+ | Model | Pass@1 | Easy Pass@1 | Medium Pass@1 | Hard Pass@1 |
90
+ |--------------------------|--------|-------------|---------------|-------------|
91
+ | **Smaug-Qwen2-72B-Instruct** | 0.3357 | 0.7286 | 0.1633 | 0.0000 |
92
+ | Qwen2-72B-Instruct | 0.3139 | 0.6810 | 0.1531 | 0.0000 |
93
+
94
+
95
  ## Arena-Hard
96
 
97
  Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.