File size: 12,347 Bytes
cac68b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 0, batch_size: 64
| Task |Version| Metric | Value | |Stderr|
|--------------|------:|--------|------:|---|-----:|
|arc_easy | 0|acc | 0.4322|± |0.0102|
| | |acc_norm| 0.3868|± |0.0100|
|boolq | 1|acc | 0.6092|± |0.0085|
|lambada_openai| 0|ppl |74.2399|± |2.9038|
| | |acc | 0.2604|± |0.0061|
|openbookqa | 0|acc | 0.1440|± |0.0157|
| | |acc_norm| 0.2780|± |0.0201|
|piqa | 0|acc | 0.5909|± |0.0115|
| | |acc_norm| 0.5871|± |0.0115|
|winogrande | 0|acc | 0.5225|± |0.0140|
hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 25, batch_size: 64
| Task |Version| Metric |Value | |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge| 0|acc |0.1817|± |0.0113|
| | |acc_norm|0.2329|± |0.0124|
hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 10, batch_size: 64
| Task |Version| Metric |Value | |Stderr|
|---------|------:|--------|-----:|---|-----:|
|hellaswag| 0|acc |0.2792|± |0.0045|
| | |acc_norm|0.2865|± |0.0045|
hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 0, batch_size: 64
| Task |Version|Metric|Value | |Stderr|
|-------------|------:|------|-----:|---|-----:|
|truthfulqa_mc| 1|mc1 |0.2485|± |0.0151|
| | |mc2 |0.4594|± |0.0151|
hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 5, batch_size: 64
| Task |Version| Metric |Value | |Stderr|
|-------------------------------------------------|------:|--------|-----:|---|-----:|
|hendrycksTest-abstract_algebra | 1|acc |0.2200|± |0.0416|
| | |acc_norm|0.2200|± |0.0416|
|hendrycksTest-anatomy | 1|acc |0.2741|± |0.0385|
| | |acc_norm|0.2741|± |0.0385|
|hendrycksTest-astronomy | 1|acc |0.1776|± |0.0311|
| | |acc_norm|0.1776|± |0.0311|
|hendrycksTest-business_ethics | 1|acc |0.2100|± |0.0409|
| | |acc_norm|0.2100|± |0.0409|
|hendrycksTest-clinical_knowledge | 1|acc |0.2264|± |0.0258|
| | |acc_norm|0.2264|± |0.0258|
|hendrycksTest-college_biology | 1|acc |0.2500|± |0.0362|
| | |acc_norm|0.2500|± |0.0362|
|hendrycksTest-college_chemistry | 1|acc |0.1500|± |0.0359|
| | |acc_norm|0.1500|± |0.0359|
|hendrycksTest-college_computer_science | 1|acc |0.1600|± |0.0368|
| | |acc_norm|0.1600|± |0.0368|
|hendrycksTest-college_mathematics | 1|acc |0.3000|± |0.0461|
| | |acc_norm|0.3000|± |0.0461|
|hendrycksTest-college_medicine | 1|acc |0.1908|± |0.0300|
| | |acc_norm|0.1908|± |0.0300|
|hendrycksTest-college_physics | 1|acc |0.2157|± |0.0409|
| | |acc_norm|0.2157|± |0.0409|
|hendrycksTest-computer_security | 1|acc |0.2200|± |0.0416|
| | |acc_norm|0.2200|± |0.0416|
|hendrycksTest-conceptual_physics | 1|acc |0.2383|± |0.0279|
| | |acc_norm|0.2383|± |0.0279|
|hendrycksTest-econometrics | 1|acc |0.2456|± |0.0405|
| | |acc_norm|0.2456|± |0.0405|
|hendrycksTest-electrical_engineering | 1|acc |0.2276|± |0.0349|
| | |acc_norm|0.2276|± |0.0349|
|hendrycksTest-elementary_mathematics | 1|acc |0.1772|± |0.0197|
| | |acc_norm|0.1772|± |0.0197|
|hendrycksTest-formal_logic | 1|acc |0.2460|± |0.0385|
| | |acc_norm|0.2460|± |0.0385|
|hendrycksTest-global_facts | 1|acc |0.2400|± |0.0429|
| | |acc_norm|0.2400|± |0.0429|
|hendrycksTest-high_school_biology | 1|acc |0.3065|± |0.0262|
| | |acc_norm|0.3065|± |0.0262|
|hendrycksTest-high_school_chemistry | 1|acc |0.2759|± |0.0314|
| | |acc_norm|0.2759|± |0.0314|
|hendrycksTest-high_school_computer_science | 1|acc |0.1600|± |0.0368|
| | |acc_norm|0.1600|± |0.0368|
|hendrycksTest-high_school_european_history | 1|acc |0.2242|± |0.0326|
| | |acc_norm|0.2242|± |0.0326|
|hendrycksTest-high_school_geography | 1|acc |0.2828|± |0.0321|
| | |acc_norm|0.2828|± |0.0321|
|hendrycksTest-high_school_government_and_politics| 1|acc |0.3472|± |0.0344|
| | |acc_norm|0.3472|± |0.0344|
|hendrycksTest-high_school_macroeconomics | 1|acc |0.3026|± |0.0233|
| | |acc_norm|0.3026|± |0.0233|
|hendrycksTest-high_school_mathematics | 1|acc |0.2667|± |0.0270|
| | |acc_norm|0.2667|± |0.0270|
|hendrycksTest-high_school_microeconomics | 1|acc |0.2983|± |0.0297|
| | |acc_norm|0.2983|± |0.0297|
|hendrycksTest-high_school_physics | 1|acc |0.1722|± |0.0308|
| | |acc_norm|0.1722|± |0.0308|
|hendrycksTest-high_school_psychology | 1|acc |0.2312|± |0.0181|
| | |acc_norm|0.2312|± |0.0181|
|hendrycksTest-high_school_statistics | 1|acc |0.4167|± |0.0336|
| | |acc_norm|0.4167|± |0.0336|
|hendrycksTest-high_school_us_history | 1|acc |0.2451|± |0.0302|
| | |acc_norm|0.2451|± |0.0302|
|hendrycksTest-high_school_world_history | 1|acc |0.2489|± |0.0281|
| | |acc_norm|0.2489|± |0.0281|
|hendrycksTest-human_aging | 1|acc |0.2422|± |0.0288|
| | |acc_norm|0.2422|± |0.0288|
|hendrycksTest-human_sexuality | 1|acc |0.2214|± |0.0364|
| | |acc_norm|0.2214|± |0.0364|
|hendrycksTest-international_law | 1|acc |0.3223|± |0.0427|
| | |acc_norm|0.3223|± |0.0427|
|hendrycksTest-jurisprudence | 1|acc |0.2500|± |0.0419|
| | |acc_norm|0.2500|± |0.0419|
|hendrycksTest-logical_fallacies | 1|acc |0.2454|± |0.0338|
| | |acc_norm|0.2454|± |0.0338|
|hendrycksTest-machine_learning | 1|acc |0.1964|± |0.0377|
| | |acc_norm|0.1964|± |0.0377|
|hendrycksTest-management | 1|acc |0.2427|± |0.0425|
| | |acc_norm|0.2427|± |0.0425|
|hendrycksTest-marketing | 1|acc |0.2009|± |0.0262|
| | |acc_norm|0.2009|± |0.0262|
|hendrycksTest-medical_genetics | 1|acc |0.2400|± |0.0429|
| | |acc_norm|0.2400|± |0.0429|
|hendrycksTest-miscellaneous | 1|acc |0.2593|± |0.0157|
| | |acc_norm|0.2593|± |0.0157|
|hendrycksTest-moral_disputes | 1|acc |0.2486|± |0.0233|
| | |acc_norm|0.2486|± |0.0233|
|hendrycksTest-moral_scenarios | 1|acc |0.2469|± |0.0144|
| | |acc_norm|0.2469|± |0.0144|
|hendrycksTest-nutrition | 1|acc |0.2157|± |0.0236|
| | |acc_norm|0.2157|± |0.0236|
|hendrycksTest-philosophy | 1|acc |0.2830|± |0.0256|
| | |acc_norm|0.2830|± |0.0256|
|hendrycksTest-prehistory | 1|acc |0.2377|± |0.0237|
| | |acc_norm|0.2377|± |0.0237|
|hendrycksTest-professional_accounting | 1|acc |0.2801|± |0.0268|
| | |acc_norm|0.2801|± |0.0268|
|hendrycksTest-professional_law | 1|acc |0.2458|± |0.0110|
| | |acc_norm|0.2458|± |0.0110|
|hendrycksTest-professional_medicine | 1|acc |0.2794|± |0.0273|
| | |acc_norm|0.2794|± |0.0273|
|hendrycksTest-professional_psychology | 1|acc |0.2598|± |0.0177|
| | |acc_norm|0.2598|± |0.0177|
|hendrycksTest-public_relations | 1|acc |0.2273|± |0.0401|
| | |acc_norm|0.2273|± |0.0401|
|hendrycksTest-security_studies | 1|acc |0.3388|± |0.0303|
| | |acc_norm|0.3388|± |0.0303|
|hendrycksTest-sociology | 1|acc |0.2189|± |0.0292|
| | |acc_norm|0.2189|± |0.0292|
|hendrycksTest-us_foreign_policy | 1|acc |0.2100|± |0.0409|
| | |acc_norm|0.2100|± |0.0409|
|hendrycksTest-virology | 1|acc |0.2169|± |0.0321|
| | |acc_norm|0.2169|± |0.0321|
|hendrycksTest-world_religions | 1|acc |0.2047|± |0.0309|
| | |acc_norm|0.2047|± |0.0309|
|