abacusai
/

Smaug-Qwen2-72B-Instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ArkaAbacus commited on Jun 26, 2024

Commit

a4e1166

·

verified ·

1 Parent(s): 670b7fa

Update README.md

Files changed (1) hide show

README.md +9 -1

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ Note: These results are with corrected parsing for BBH from Eleuther's [lm-evalu
 | Model                      | Groups | Version | Filter     | n-shot | Metric      | Value  |   | Stderr |
 |----------------------------|--------|---------|------------|--------|-------------|--------|---|--------|
-| Smaug-Qwen2-72B-Instruct   | bbh    | N/A     | get-answer | 3      | exact_match | 0.8241 | ± | 0.0042 |
 | Qwen2-72B-Instruct         | bbh    | N/A     | get-answer | 3      | exact_match | 0.8036 | ± | 0.0044 |
 #### Breakdown:
@@ -84,6 +84,14 @@ Qwen2-72B-Instruct:
 | - bbh_cot_fewshot_web_of_lies                             | 2       | get-answer | 3      | exact_match | 1.0000 | 0.0000 |
 | - bbh_cot_fewshot_word_sorting                            | 2       | get-answer | 3      | exact_match | 0.6680 | 0.0298 |
 ## Arena-Hard
 Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

 | Model                      | Groups | Version | Filter     | n-shot | Metric      | Value  |   | Stderr |
 |----------------------------|--------|---------|------------|--------|-------------|--------|---|--------|
+| **Smaug-Qwen2-72B-Instruct**   | bbh    | N/A     | get-answer | 3      | exact_match | 0.8241 | ± | 0.0042 |
 | Qwen2-72B-Instruct         | bbh    | N/A     | get-answer | 3      | exact_match | 0.8036 | ± | 0.0044 |
 #### Breakdown:
 | - bbh_cot_fewshot_web_of_lies                             | 2       | get-answer | 3      | exact_match | 1.0000 | 0.0000 |
 | - bbh_cot_fewshot_word_sorting                            | 2       | get-answer | 3      | exact_match | 0.6680 | 0.0298 |
+## LiveCodeBench
+| Model                    | Pass@1 | Easy Pass@1 | Medium Pass@1 | Hard Pass@1 |
+|--------------------------|--------|-------------|---------------|-------------|
+| **Smaug-Qwen2-72B-Instruct** | 0.3357 | 0.7286      | 0.1633        | 0.0000      |
+| Qwen2-72B-Instruct       | 0.3139 | 0.6810      | 0.1531        | 0.0000      |
 ## Arena-Hard
 Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.