abacusai
/

Smaug-Qwen2-72B-Instruct

 tags: []
 ---
+# Evaluation Results
+## Big-Bench Hard (BBH)
+Note: These results are with corrected parsing for BBH from Eleuther's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). See [this PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/2013).
+### Smaug-Qwen2-72B-Instruct
+#### Overall:
+|Groups|Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
+|------|-------|----------|-----:|-----------|---|-----:|---|-----:|
+|bbh   |N/A    |get-answer|     3|exact_match|↑  |0.8241|±  |0.0042|
+#### Breakdown:
+|                          Tasks                           |Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
+|----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|
+|bbh                                                       |N/A    |get-answer|     3|exact_match|↑  |0.8241|±  |0.0042|
+| - bbh_cot_fewshot_boolean_expressions                    |      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|
+| - bbh_cot_fewshot_causal_judgement                       |      2|get-answer|     3|exact_match|↑  |0.6578|±  |0.0348|
+| - bbh_cot_fewshot_date_understanding                     |      2|get-answer|     3|exact_match|↑  |0.8360|±  |0.0235|
+| - bbh_cot_fewshot_disambiguation_qa                      |      2|get-answer|     3|exact_match|↑  |0.8280|±  |0.0239|
+| - bbh_cot_fewshot_dyck_languages                         |      2|get-answer|     3|exact_match|↑  |0.3360|±  |0.0299|
+| - bbh_cot_fewshot_formal_fallacies                       |      2|get-answer|     3|exact_match|↑  |0.7120|±  |0.0287|
+| - bbh_cot_fewshot_geometric_shapes                       |      2|get-answer|     3|exact_match|↑  |0.5320|±  |0.0316|
+| - bbh_cot_fewshot_hyperbaton                             |      2|get-answer|     3|exact_match|↑  |0.9880|±  |0.0069|
+| - bbh_cot_fewshot_logical_deduction_five_objects         |      2|get-answer|     3|exact_match|↑  |0.7680|±  |0.0268|
+| - bbh_cot_fewshot_logical_deduction_seven_objects        |      2|get-answer|     3|exact_match|↑  |0.5360|±  |0.0316|
+| - bbh_cot_fewshot_logical_deduction_three_objects        |      2|get-answer|     3|exact_match|↑  |0.9720|±  |0.0105|
+| - bbh_cot_fewshot_movie_recommendation                   |      2|get-answer|     3|exact_match|↑  |0.8000|±  |0.0253|
+| - bbh_cot_fewshot_multistep_arithmetic_two               |      2|get-answer|     3|exact_match|↑  |0.9720|±  |0.0105|
+| - bbh_cot_fewshot_navigate                               |      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|
+| - bbh_cot_fewshot_object_counting                        |      2|get-answer|     3|exact_match|↑  |0.9200|±  |0.0172|
+| - bbh_cot_fewshot_penguins_in_a_table                    |      2|get-answer|     3|exact_match|↑  |0.8493|±  |0.0297|
+| - bbh_cot_fewshot_reasoning_about_colored_objects        |      2|get-answer|     3|exact_match|↑  |0.7560|±  |0.0272|
+| - bbh_cot_fewshot_ruin_names                             |      2|get-answer|     3|exact_match|↑  |0.8520|±  |0.0225|
+| - bbh_cot_fewshot_salient_translation_error_detection    |      2|get-answer|     3|exact_match|↑  |0.5920|±  |0.0311|
+| - bbh_cot_fewshot_snarks                                 |      2|get-answer|     3|exact_match|↑  |0.9101|±  |0.0215|
+| - bbh_cot_fewshot_sports_understanding                   |      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|
+| - bbh_cot_fewshot_temporal_sequences                     |      2|get-answer|     3|exact_match|↑  |1.0000|±  |0.0000|
+| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects |      2|get-answer|     3|exact_match|↑  |0.9800|±  |0.0089|
+| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects|      2|get-answer|     3|exact_match|↑  |0.9560|±  |0.0130|
+| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects|      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|
+| - bbh_cot_fewshot_web_of_lies                            |      2|get-answer|     3|exact_match|↑  |1.0000|±  |0.0000|
+| - bbh_cot_fewshot_word_sorting                           |      2|get-answer|     3|exact_match|↑  |0.6560|±  |0.0301|
+### Qwen2-72B-Instruct
+#### Overall:
+|Groups|Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
+|------|-------|----------|-----:|-----------|---|-----:|---|-----:|
+|bbh   |N/A    |get-answer|     3|exact_match|↑  |0.8036|±  |0.0044|
+#### Breakdown:
+|                          Tasks                           |Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
+|----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|
+|bbh                                                       |N/A    |get-answer|     3|exact_match|↑  |0.8036|±  |0.0044|
+| - bbh_cot_fewshot_boolean_expressions                    |      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|
+| - bbh_cot_fewshot_causal_judgement                       |      2|get-answer|     3|exact_match|↑  |0.6684|±  |0.0345|
+| - bbh_cot_fewshot_date_understanding                     |      2|get-answer|     3|exact_match|↑  |0.8000|±  |0.0253|
+| - bbh_cot_fewshot_disambiguation_qa                      |      2|get-answer|     3|exact_match|↑  |0.8360|±  |0.0235|
+| - bbh_cot_fewshot_dyck_languages                         |      2|get-answer|     3|exact_match|↑  |0.3040|±  |0.0292|
+| - bbh_cot_fewshot_formal_fallacies                       |      2|get-answer|     3|exact_match|↑  |0.7480|±  |0.0275|
+| - bbh_cot_fewshot_geometric_shapes                       |      2|get-answer|     3|exact_match|↑  |0.4960|±  |0.0317|
+| - bbh_cot_fewshot_hyperbaton                             |      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|
+| - bbh_cot_fewshot_logical_deduction_five_objects         |      2|get-answer|     3|exact_match|↑  |0.6800|±  |0.0296|
+| - bbh_cot_fewshot_logical_deduction_seven_objects        |      2|get-answer|     3|exact_match|↑  |0.4720|±  |0.0316|
+| - bbh_cot_fewshot_logical_deduction_three_objects        |      2|get-answer|     3|exact_match|↑  |0.9200|±  |0.0172|
+| - bbh_cot_fewshot_movie_recommendation                   |      2|get-answer|     3|exact_match|↑  |0.7800|±  |0.0263|
+| - bbh_cot_fewshot_multistep_arithmetic_two               |      2|get-answer|     3|exact_match|↑  |0.9760|±  |0.0097|
+| - bbh_cot_fewshot_navigate                               |      2|get-answer|     3|exact_match|↑  |0.9520|±  |0.0135|
+| - bbh_cot_fewshot_object_counting                        |      2|get-answer|     3|exact_match|↑  |0.9480|±  |0.0141|
+| - bbh_cot_fewshot_penguins_in_a_table                    |      2|get-answer|     3|exact_match|↑  |0.5753|±  |0.0410|
+| - bbh_cot_fewshot_reasoning_about_colored_objects        |      2|get-answer|     3|exact_match|↑  |0.8120|±  |0.0248|
+| - bbh_cot_fewshot_ruin_names                             |      2|get-answer|     3|exact_match|↑  |0.8760|±  |0.0209|
+| - bbh_cot_fewshot_salient_translation_error_detection    |      2|get-answer|     3|exact_match|↑  |0.5880|±  |0.0312|
+| - bbh_cot_fewshot_snarks                                 |      2|get-answer|     3|exact_match|↑  |0.8764|±  |0.0247|
+| - bbh_cot_fewshot_sports_understanding                   |      2|get-answer|     3|exact_match|↑  |0.9080|±  |0.0183|
+| - bbh_cot_fewshot_temporal_sequences                     |      2|get-answer|     3|exact_match|↑  |0.9960|±  |0.0040|
+| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects |      2|get-answer|     3|exact_match|↑  |0.9160|±  |0.0176|
+| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects|      2|get-answer|     3|exact_match|↑  |0.9400|±  |0.0151|
+| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects|      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|
+| - bbh_cot_fewshot_web_of_lies                            |      2|get-answer|     3|exact_match|↑  |1.0000|±  |0.0000|
+| - bbh_cot_fewshot_word_sorting                           |      2|get-answer|     3|exact_match|↑  |0.6680|±  |0.0298|
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->