File size: 12,347 Bytes
cac68b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 0, batch_size: 64
|     Task     |Version| Metric | Value |   |Stderr|
|--------------|------:|--------|------:|---|-----:|
|arc_easy      |      0|acc     | 0.4322|±  |0.0102|
|              |       |acc_norm| 0.3868|±  |0.0100|
|boolq         |      1|acc     | 0.6092|±  |0.0085|
|lambada_openai|      0|ppl     |74.2399|±  |2.9038|
|              |       |acc     | 0.2604|±  |0.0061|
|openbookqa    |      0|acc     | 0.1440|±  |0.0157|
|              |       |acc_norm| 0.2780|±  |0.0201|
|piqa          |      0|acc     | 0.5909|±  |0.0115|
|              |       |acc_norm| 0.5871|±  |0.0115|
|winogrande    |      0|acc     | 0.5225|±  |0.0140|

hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 25, batch_size: 64
|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|arc_challenge|      0|acc     |0.1817|±  |0.0113|
|             |       |acc_norm|0.2329|±  |0.0124|

hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 10, batch_size: 64
|  Task   |Version| Metric |Value |   |Stderr|
|---------|------:|--------|-----:|---|-----:|
|hellaswag|      0|acc     |0.2792|±  |0.0045|
|         |       |acc_norm|0.2865|±  |0.0045|

hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 0, batch_size: 64
|    Task     |Version|Metric|Value |   |Stderr|
|-------------|------:|------|-----:|---|-----:|
|truthfulqa_mc|      1|mc1   |0.2485|±  |0.0151|
|             |       |mc2   |0.4594|±  |0.0151|

hf-causal-experimental (pretrained=BEE-spoke-data/smol_llama-101M-GQA,trust_remote_code=True,dtype=float), limit: None, provide_description: False, num_fewshot: 5, batch_size: 64
|                      Task                       |Version| Metric |Value |   |Stderr|
|-------------------------------------------------|------:|--------|-----:|---|-----:|
|hendrycksTest-abstract_algebra                   |      1|acc     |0.2200|±  |0.0416|
|                                                 |       |acc_norm|0.2200|±  |0.0416|
|hendrycksTest-anatomy                            |      1|acc     |0.2741|±  |0.0385|
|                                                 |       |acc_norm|0.2741|±  |0.0385|
|hendrycksTest-astronomy                          |      1|acc     |0.1776|±  |0.0311|
|                                                 |       |acc_norm|0.1776|±  |0.0311|
|hendrycksTest-business_ethics                    |      1|acc     |0.2100|±  |0.0409|
|                                                 |       |acc_norm|0.2100|±  |0.0409|
|hendrycksTest-clinical_knowledge                 |      1|acc     |0.2264|±  |0.0258|
|                                                 |       |acc_norm|0.2264|±  |0.0258|
|hendrycksTest-college_biology                    |      1|acc     |0.2500|±  |0.0362|
|                                                 |       |acc_norm|0.2500|±  |0.0362|
|hendrycksTest-college_chemistry                  |      1|acc     |0.1500|±  |0.0359|
|                                                 |       |acc_norm|0.1500|±  |0.0359|
|hendrycksTest-college_computer_science           |      1|acc     |0.1600|±  |0.0368|
|                                                 |       |acc_norm|0.1600|±  |0.0368|
|hendrycksTest-college_mathematics                |      1|acc     |0.3000|±  |0.0461|
|                                                 |       |acc_norm|0.3000|±  |0.0461|
|hendrycksTest-college_medicine                   |      1|acc     |0.1908|±  |0.0300|
|                                                 |       |acc_norm|0.1908|±  |0.0300|
|hendrycksTest-college_physics                    |      1|acc     |0.2157|±  |0.0409|
|                                                 |       |acc_norm|0.2157|±  |0.0409|
|hendrycksTest-computer_security                  |      1|acc     |0.2200|±  |0.0416|
|                                                 |       |acc_norm|0.2200|±  |0.0416|
|hendrycksTest-conceptual_physics                 |      1|acc     |0.2383|±  |0.0279|
|                                                 |       |acc_norm|0.2383|±  |0.0279|
|hendrycksTest-econometrics                       |      1|acc     |0.2456|±  |0.0405|
|                                                 |       |acc_norm|0.2456|±  |0.0405|
|hendrycksTest-electrical_engineering             |      1|acc     |0.2276|±  |0.0349|
|                                                 |       |acc_norm|0.2276|±  |0.0349|
|hendrycksTest-elementary_mathematics             |      1|acc     |0.1772|±  |0.0197|
|                                                 |       |acc_norm|0.1772|±  |0.0197|
|hendrycksTest-formal_logic                       |      1|acc     |0.2460|±  |0.0385|
|                                                 |       |acc_norm|0.2460|±  |0.0385|
|hendrycksTest-global_facts                       |      1|acc     |0.2400|±  |0.0429|
|                                                 |       |acc_norm|0.2400|±  |0.0429|
|hendrycksTest-high_school_biology                |      1|acc     |0.3065|±  |0.0262|
|                                                 |       |acc_norm|0.3065|±  |0.0262|
|hendrycksTest-high_school_chemistry              |      1|acc     |0.2759|±  |0.0314|
|                                                 |       |acc_norm|0.2759|±  |0.0314|
|hendrycksTest-high_school_computer_science       |      1|acc     |0.1600|±  |0.0368|
|                                                 |       |acc_norm|0.1600|±  |0.0368|
|hendrycksTest-high_school_european_history       |      1|acc     |0.2242|±  |0.0326|
|                                                 |       |acc_norm|0.2242|±  |0.0326|
|hendrycksTest-high_school_geography              |      1|acc     |0.2828|±  |0.0321|
|                                                 |       |acc_norm|0.2828|±  |0.0321|
|hendrycksTest-high_school_government_and_politics|      1|acc     |0.3472|±  |0.0344|
|                                                 |       |acc_norm|0.3472|±  |0.0344|
|hendrycksTest-high_school_macroeconomics         |      1|acc     |0.3026|±  |0.0233|
|                                                 |       |acc_norm|0.3026|±  |0.0233|
|hendrycksTest-high_school_mathematics            |      1|acc     |0.2667|±  |0.0270|
|                                                 |       |acc_norm|0.2667|±  |0.0270|
|hendrycksTest-high_school_microeconomics         |      1|acc     |0.2983|±  |0.0297|
|                                                 |       |acc_norm|0.2983|±  |0.0297|
|hendrycksTest-high_school_physics                |      1|acc     |0.1722|±  |0.0308|
|                                                 |       |acc_norm|0.1722|±  |0.0308|
|hendrycksTest-high_school_psychology             |      1|acc     |0.2312|±  |0.0181|
|                                                 |       |acc_norm|0.2312|±  |0.0181|
|hendrycksTest-high_school_statistics             |      1|acc     |0.4167|±  |0.0336|
|                                                 |       |acc_norm|0.4167|±  |0.0336|
|hendrycksTest-high_school_us_history             |      1|acc     |0.2451|±  |0.0302|
|                                                 |       |acc_norm|0.2451|±  |0.0302|
|hendrycksTest-high_school_world_history          |      1|acc     |0.2489|±  |0.0281|
|                                                 |       |acc_norm|0.2489|±  |0.0281|
|hendrycksTest-human_aging                        |      1|acc     |0.2422|±  |0.0288|
|                                                 |       |acc_norm|0.2422|±  |0.0288|
|hendrycksTest-human_sexuality                    |      1|acc     |0.2214|±  |0.0364|
|                                                 |       |acc_norm|0.2214|±  |0.0364|
|hendrycksTest-international_law                  |      1|acc     |0.3223|±  |0.0427|
|                                                 |       |acc_norm|0.3223|±  |0.0427|
|hendrycksTest-jurisprudence                      |      1|acc     |0.2500|±  |0.0419|
|                                                 |       |acc_norm|0.2500|±  |0.0419|
|hendrycksTest-logical_fallacies                  |      1|acc     |0.2454|±  |0.0338|
|                                                 |       |acc_norm|0.2454|±  |0.0338|
|hendrycksTest-machine_learning                   |      1|acc     |0.1964|±  |0.0377|
|                                                 |       |acc_norm|0.1964|±  |0.0377|
|hendrycksTest-management                         |      1|acc     |0.2427|±  |0.0425|
|                                                 |       |acc_norm|0.2427|±  |0.0425|
|hendrycksTest-marketing                          |      1|acc     |0.2009|±  |0.0262|
|                                                 |       |acc_norm|0.2009|±  |0.0262|
|hendrycksTest-medical_genetics                   |      1|acc     |0.2400|±  |0.0429|
|                                                 |       |acc_norm|0.2400|±  |0.0429|
|hendrycksTest-miscellaneous                      |      1|acc     |0.2593|±  |0.0157|
|                                                 |       |acc_norm|0.2593|±  |0.0157|
|hendrycksTest-moral_disputes                     |      1|acc     |0.2486|±  |0.0233|
|                                                 |       |acc_norm|0.2486|±  |0.0233|
|hendrycksTest-moral_scenarios                    |      1|acc     |0.2469|±  |0.0144|
|                                                 |       |acc_norm|0.2469|±  |0.0144|
|hendrycksTest-nutrition                          |      1|acc     |0.2157|±  |0.0236|
|                                                 |       |acc_norm|0.2157|±  |0.0236|
|hendrycksTest-philosophy                         |      1|acc     |0.2830|±  |0.0256|
|                                                 |       |acc_norm|0.2830|±  |0.0256|
|hendrycksTest-prehistory                         |      1|acc     |0.2377|±  |0.0237|
|                                                 |       |acc_norm|0.2377|±  |0.0237|
|hendrycksTest-professional_accounting            |      1|acc     |0.2801|±  |0.0268|
|                                                 |       |acc_norm|0.2801|±  |0.0268|
|hendrycksTest-professional_law                   |      1|acc     |0.2458|±  |0.0110|
|                                                 |       |acc_norm|0.2458|±  |0.0110|
|hendrycksTest-professional_medicine              |      1|acc     |0.2794|±  |0.0273|
|                                                 |       |acc_norm|0.2794|±  |0.0273|
|hendrycksTest-professional_psychology            |      1|acc     |0.2598|±  |0.0177|
|                                                 |       |acc_norm|0.2598|±  |0.0177|
|hendrycksTest-public_relations                   |      1|acc     |0.2273|±  |0.0401|
|                                                 |       |acc_norm|0.2273|±  |0.0401|
|hendrycksTest-security_studies                   |      1|acc     |0.3388|±  |0.0303|
|                                                 |       |acc_norm|0.3388|±  |0.0303|
|hendrycksTest-sociology                          |      1|acc     |0.2189|±  |0.0292|
|                                                 |       |acc_norm|0.2189|±  |0.0292|
|hendrycksTest-us_foreign_policy                  |      1|acc     |0.2100|±  |0.0409|
|                                                 |       |acc_norm|0.2100|±  |0.0409|
|hendrycksTest-virology                           |      1|acc     |0.2169|±  |0.0321|
|                                                 |       |acc_norm|0.2169|±  |0.0321|
|hendrycksTest-world_religions                    |      1|acc     |0.2047|±  |0.0309|
|                                                 |       |acc_norm|0.2047|±  |0.0309|