kmchiti commited on
Commit
afce414
·
verified ·
1 Parent(s): 346d574

Upload results/eval_difficulty/checkpoint-18779_metrics.json

Browse files
results/eval_difficulty/checkpoint-18779_metrics.json ADDED
@@ -0,0 +1,1979 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "checkpoint": "checkpoint-18779",
3
+ "id": {
4
+ "answer_accuracy": 0.9630347222222222,
5
+ "avg_response_len": 201.21272916666666,
6
+ "avg_loss": 6.54602900314331,
7
+ "count": 1152000,
8
+ "validation_metrics": {},
9
+ "pass_at_k": {
10
+ "pass@1": 0.9630347222222222,
11
+ "pass@2": 0.9784445264654293,
12
+ "pass@4": 0.9859757670187062,
13
+ "pass@8": 0.9905464808595384,
14
+ "pass@16": 0.9936741909013082,
15
+ "pass@32": 0.9958726768097093,
16
+ "pass@64": 0.9973773311067934,
17
+ "pass@128": 0.9983333333333333
18
+ },
19
+ "per_op_pass_at_k": {
20
+ "10": {
21
+ "pass@1": 0.89534375,
22
+ "pass@2": 0.9316516978346466,
23
+ "pass@4": 0.9502444778777662,
24
+ "pass@8": 0.9628981587488723,
25
+ "pass@16": 0.9732652153333301,
26
+ "pass@32": 0.9817272204905513,
27
+ "pass@64": 0.9875005300252477,
28
+ "pass@128": 0.991
29
+ },
30
+ "2": {
31
+ "pass@1": 0.9994453125,
32
+ "pass@2": 0.9998784448818898,
33
+ "pass@4": 0.999987270247469,
34
+ "pass@8": 0.9999998760353234,
35
+ "pass@16": 0.9999999999955357,
36
+ "pass@32": 1.0,
37
+ "pass@64": 1.0,
38
+ "pass@128": 1.0
39
+ },
40
+ "3": {
41
+ "pass@1": 0.9896953125,
42
+ "pass@2": 0.9956350885826775,
43
+ "pass@4": 0.998029471128609,
44
+ "pass@8": 0.9990296020970205,
45
+ "pass@16": 0.9994722941030695,
46
+ "pass@32": 0.999768854949753,
47
+ "pass@64": 0.9999711811022791,
48
+ "pass@128": 1.0
49
+ },
50
+ "4": {
51
+ "pass@1": 0.9968125,
52
+ "pass@2": 0.9986126968503938,
53
+ "pass@4": 0.9994299414135731,
54
+ "pass@8": 0.9998706652443888,
55
+ "pass@16": 0.999993190320664,
56
+ "pass@32": 0.9999999900668188,
57
+ "pass@64": 0.9999999999999999,
58
+ "pass@128": 1.0
59
+ },
60
+ "5": {
61
+ "pass@1": 0.95453125,
62
+ "pass@2": 0.9673741387795279,
63
+ "pass@4": 0.9768552553430826,
64
+ "pass@8": 0.9845504029004547,
65
+ "pass@16": 0.9905860749032567,
66
+ "pass@32": 0.9947039335217139,
67
+ "pass@64": 0.9972231304095797,
68
+ "pass@128": 0.999
69
+ },
70
+ "6": {
71
+ "pass@1": 0.9755546875,
72
+ "pass@2": 0.9871010088582685,
73
+ "pass@4": 0.9931382349081366,
74
+ "pass@8": 0.9961125829051054,
75
+ "pass@16": 0.9972334916240602,
76
+ "pass@32": 0.9976868579347262,
77
+ "pass@64": 0.998251968426333,
78
+ "pass@128": 0.999
79
+ },
80
+ "7": {
81
+ "pass@1": 0.9665546875,
82
+ "pass@2": 0.980099409448819,
83
+ "pass@4": 0.9868899991563556,
84
+ "pass@8": 0.9912036930431989,
85
+ "pass@16": 0.9940644245074017,
86
+ "pass@32": 0.9960585379187011,
87
+ "pass@64": 0.9974804599031311,
88
+ "pass@128": 0.998
89
+ },
90
+ "8": {
91
+ "pass@1": 0.95003125,
92
+ "pass@2": 0.9737233021653545,
93
+ "pass@4": 0.984498448537683,
94
+ "pass@8": 0.9902565507718185,
95
+ "pass@16": 0.9939245919129133,
96
+ "pass@32": 0.996586911336722,
97
+ "pass@64": 0.998538493317924,
98
+ "pass@128": 1.0
99
+ },
100
+ "9": {
101
+ "pass@1": 0.93934375,
102
+ "pass@2": 0.9719249507874034,
103
+ "pass@4": 0.98470880455568,
104
+ "pass@8": 0.990996795989665,
105
+ "pass@16": 0.9945284354115372,
106
+ "pass@32": 0.9963217850684021,
107
+ "pass@64": 0.9974302167766392,
108
+ "pass@128": 0.998
109
+ }
110
+ },
111
+ "per_op_accuracy": {
112
+ "10": 0.89534375,
113
+ "2": 0.9994453125,
114
+ "3": 0.9896953125,
115
+ "4": 0.9968125,
116
+ "5": 0.95453125,
117
+ "6": 0.9755546875,
118
+ "7": 0.9665546875,
119
+ "8": 0.95003125,
120
+ "9": 0.93934375
121
+ },
122
+ "per_op_avg_response_len": {
123
+ "10": 279.685734375,
124
+ "2": 116.979234375,
125
+ "3": 147.114546875,
126
+ "4": 160.9594765625,
127
+ "5": 176.0514453125,
128
+ "6": 200.8557890625,
129
+ "7": 225.0676796875,
130
+ "8": 244.2956640625,
131
+ "9": 259.9049921875
132
+ },
133
+ "per_op_avg_loss": {
134
+ "10": 0.13990753173828124,
135
+ "2": 0.15822406005859374,
136
+ "3": 0.15319696044921874,
137
+ "4": 0.1460938720703125,
138
+ "5": 0.1487813720703125,
139
+ "6": 0.14940606689453126,
140
+ "7": 0.14932769775390625,
141
+ "8": 0.14807330322265624,
142
+ "9": 0.1509442138671875
143
+ },
144
+ "per_op_length": {
145
+ "10": 279.685734375,
146
+ "2": 116.979234375,
147
+ "3": 147.114546875,
148
+ "4": 160.9594765625,
149
+ "5": 176.0514453125,
150
+ "6": 200.8557890625,
151
+ "7": 225.0676796875,
152
+ "8": 244.2956640625,
153
+ "9": 259.9049921875
154
+ },
155
+ "per_op_loss": {
156
+ "10": 0.13990753173828124,
157
+ "2": 0.15822406005859374,
158
+ "3": 0.15319696044921874,
159
+ "4": 0.1460938720703125,
160
+ "5": 0.1487813720703125,
161
+ "6": 0.14940606689453126,
162
+ "7": 0.14932769775390625,
163
+ "8": 0.14807330322265624,
164
+ "9": 0.1509442138671875
165
+ },
166
+ "per_template": {
167
+ "crazy_zootopia": {
168
+ "count": 379008,
169
+ "correct": 364061,
170
+ "answer_accuracy": 0.960562837723742,
171
+ "avg_response_len": 191.88788099459643,
172
+ "resp_tokens_sum": 72727042,
173
+ "pass_at_k": {
174
+ "pass@1": 0.960562837723742,
175
+ "pass@2": 0.9772055171959906,
176
+ "pass@4": 0.985136322815272,
177
+ "pass@8": 0.9899132907085476,
178
+ "pass@16": 0.9932574318987277,
179
+ "pass@32": 0.9956698086202027,
180
+ "pass@64": 0.9972163254747021,
181
+ "pass@128": 0.9979736575481256
182
+ },
183
+ "per_op_pass_at_k": {
184
+ "10": {
185
+ "pass@1": 0.8994976032448377,
186
+ "pass@2": 0.9366747090795067,
187
+ "pass@4": 0.954640098361598,
188
+ "pass@8": 0.9671935395338649,
189
+ "pass@16": 0.9780409476989385,
190
+ "pass@32": 0.9873672705205916,
191
+ "pass@64": 0.9931124421272916,
192
+ "pass@128": 0.9941002949852508
193
+ },
194
+ "2": {
195
+ "pass@1": 0.9997322819314641,
196
+ "pass@2": 0.9999915679348491,
197
+ "pass@4": 0.9999999897793149,
198
+ "pass@8": 1.0,
199
+ "pass@16": 1.0,
200
+ "pass@32": 1.0,
201
+ "pass@64": 1.0,
202
+ "pass@128": 1.0
203
+ },
204
+ "3": {
205
+ "pass@1": 0.9875525611620795,
206
+ "pass@2": 0.9937483445303281,
207
+ "pass@4": 0.9960841566249173,
208
+ "pass@8": 0.9973945614382809,
209
+ "pass@16": 0.9984052318072489,
210
+ "pass@32": 0.9992931754002767,
211
+ "pass@64": 0.9999118688143109,
212
+ "pass@128": 1.0
213
+ },
214
+ "4": {
215
+ "pass@1": 0.996337890625,
216
+ "pass@2": 0.9982602577509845,
217
+ "pass@4": 0.9992694307742784,
218
+ "pass@8": 0.9998455780548049,
219
+ "pass@16": 0.9999940908757206,
220
+ "pass@32": 0.9999999967265157,
221
+ "pass@64": 1.0,
222
+ "pass@128": 1.0
223
+ },
224
+ "5": {
225
+ "pass@1": 0.9481150793650793,
226
+ "pass@2": 0.9641533089613797,
227
+ "pass@4": 0.9757381440415184,
228
+ "pass@8": 0.9837977641761237,
229
+ "pass@16": 0.9891567790393069,
230
+ "pass@32": 0.9924331098404072,
231
+ "pass@64": 0.9948057827917357,
232
+ "pass@128": 0.9968253968253968
233
+ },
234
+ "6": {
235
+ "pass@1": 0.9665746631736527,
236
+ "pass@2": 0.9810815131547932,
237
+ "pass@4": 0.9889092237347574,
238
+ "pass@8": 0.992817054998713,
239
+ "pass@16": 0.9942118482404367,
240
+ "pass@32": 0.9947566055669427,
241
+ "pass@64": 0.9955089818359227,
242
+ "pass@128": 0.9970059880239521
243
+ },
244
+ "7": {
245
+ "pass@1": 0.9575397559171598,
246
+ "pass@2": 0.9706686361412663,
247
+ "pass@4": 0.9781720887884576,
248
+ "pass@8": 0.9836457175352227,
249
+ "pass@16": 0.9877211930993773,
250
+ "pass@32": 0.991030454111008,
251
+ "pass@64": 0.9934560055493304,
252
+ "pass@128": 0.9940828402366864
253
+ },
254
+ "8": {
255
+ "pass@1": 0.9534755608974359,
256
+ "pass@2": 0.9758759684786995,
257
+ "pass@4": 0.9847553208493165,
258
+ "pass@8": 0.9897007291187125,
259
+ "pass@16": 0.9935694000922215,
260
+ "pass@32": 0.9965151373391712,
261
+ "pass@64": 0.998300039303945,
262
+ "pass@128": 1.0
263
+ },
264
+ "9": {
265
+ "pass@1": 0.9408450704225352,
266
+ "pass@2": 0.9764160751913055,
267
+ "pass@4": 0.9897490211610868,
268
+ "pass@8": 0.9954928597779902,
269
+ "pass@16": 0.9986146363237696,
270
+ "pass@32": 0.9998472651017836,
271
+ "pass@64": 0.9999990904114461,
272
+ "pass@128": 1.0
273
+ }
274
+ }
275
+ },
276
+ "teachers_in_school": {
277
+ "count": 390656,
278
+ "correct": 376417,
279
+ "answer_accuracy": 0.9635510525884666,
280
+ "avg_response_len": 199.2523191759502,
281
+ "resp_tokens_sum": 77839114,
282
+ "pass_at_k": {
283
+ "pass@1": 0.9635510525884666,
284
+ "pass@2": 0.978026663489027,
285
+ "pass@4": 0.9851850014079184,
286
+ "pass@8": 0.9897531921524919,
287
+ "pass@16": 0.9930249781079535,
288
+ "pass@32": 0.9953635939722416,
289
+ "pass@64": 0.9969830500068908,
290
+ "pass@128": 0.9980340760157274
291
+ },
292
+ "per_op_pass_at_k": {
293
+ "10": {
294
+ "pass@1": 0.8878930214723927,
295
+ "pass@2": 0.9240016333751996,
296
+ "pass@4": 0.9427751138086263,
297
+ "pass@8": 0.9559837013558444,
298
+ "pass@16": 0.9669589357176731,
299
+ "pass@32": 0.9757400698048759,
300
+ "pass@64": 0.9820426961144421,
301
+ "pass@128": 0.9877300613496932
302
+ },
303
+ "2": {
304
+ "pass@1": 0.9987177051671733,
305
+ "pass@2": 0.9996395064978579,
306
+ "pass@4": 0.9999613177152248,
307
+ "pass@8": 0.9999996232076699,
308
+ "pass@16": 0.9999999999864305,
309
+ "pass@32": 1.0,
310
+ "pass@64": 1.0,
311
+ "pass@128": 1.0
312
+ },
313
+ "3": {
314
+ "pass@1": 0.9907670454545454,
315
+ "pass@2": 0.9960163892865667,
316
+ "pass@4": 0.9985778993534897,
317
+ "pass@8": 0.9996915957072215,
318
+ "pass@16": 0.9999815253902985,
319
+ "pass@32": 0.99999995938087,
320
+ "pass@64": 0.9999999999999987,
321
+ "pass@128": 1.0
322
+ },
323
+ "4": {
324
+ "pass@1": 0.9982664571005917,
325
+ "pass@2": 0.9998132687648512,
326
+ "pass@4": 0.999995557138198,
327
+ "pass@8": 0.9999999967656203,
328
+ "pass@16": 0.9999999999999998,
329
+ "pass@32": 1.0,
330
+ "pass@64": 1.0,
331
+ "pass@128": 1.0
332
+ },
333
+ "5": {
334
+ "pass@1": 0.9600317028985508,
335
+ "pass@2": 0.9691975493552438,
336
+ "pass@4": 0.9763000869456536,
337
+ "pass@8": 0.9828931121195829,
338
+ "pass@16": 0.989036439654314,
339
+ "pass@32": 0.9944838684919953,
340
+ "pass@64": 0.9981653755234783,
341
+ "pass@128": 1.0
342
+ },
343
+ "6": {
344
+ "pass@1": 0.9823379297994269,
345
+ "pass@2": 0.9910194058389551,
346
+ "pass@4": 0.9946701218221643,
347
+ "pass@8": 0.9965445299840145,
348
+ "pass@16": 0.9976712739922713,
349
+ "pass@32": 0.9983904225804993,
350
+ "pass@64": 0.9992893080034889,
351
+ "pass@128": 1.0
352
+ },
353
+ "7": {
354
+ "pass@1": 0.9681855130057804,
355
+ "pass@2": 0.9832421146056164,
356
+ "pass@4": 0.9914642009163593,
357
+ "pass@8": 0.9964758591421767,
358
+ "pass@16": 0.9990374180123244,
359
+ "pass@32": 0.9999126056145744,
360
+ "pass@64": 0.9999997855847184,
361
+ "pass@128": 1.0
362
+ },
363
+ "8": {
364
+ "pass@1": 0.9455765845070423,
365
+ "pass@2": 0.9712158284351782,
366
+ "pass@4": 0.9835400283591312,
367
+ "pass@8": 0.9910390348371692,
368
+ "pass@16": 0.9955726997113149,
369
+ "pass@32": 0.9980539040490879,
370
+ "pass@64": 0.9995741614459047,
371
+ "pass@128": 1.0
372
+ },
373
+ "9": {
374
+ "pass@1": 0.9390437874251497,
375
+ "pass@2": 0.9670264539818006,
376
+ "pass@4": 0.9783135702722786,
377
+ "pass@8": 0.9840906736702117,
378
+ "pass@16": 0.9879928605789021,
379
+ "pass@32": 0.9908513468886352,
380
+ "pass@64": 0.9930496212972972,
381
+ "pass@128": 0.9940119760479041
382
+ }
383
+ }
384
+ },
385
+ "movie_festival_awards": {
386
+ "count": 382336,
387
+ "correct": 368938,
388
+ "answer_accuracy": 0.9649575242718447,
389
+ "avg_response_len": 212.4594806662203,
390
+ "resp_tokens_sum": 81230908,
391
+ "pass_at_k": {
392
+ "pass@1": 0.9649575242718447,
393
+ "pass@2": 0.9800997069980434,
394
+ "pass@4": 0.9876158778089633,
395
+ "pass@8": 0.9919847109134324,
396
+ "pass@16": 0.9947506625624947,
397
+ "pass@32": 0.9965939401271129,
398
+ "pass@64": 0.9979397963205583,
399
+ "pass@128": 0.9989956478071644
400
+ },
401
+ "per_op_pass_at_k": {
402
+ "10": {
403
+ "pass@1": 0.8983908582089553,
404
+ "pass@2": 0.9340132506757551,
405
+ "pass@4": 0.9530650669599133,
406
+ "pass@8": 0.9652801856862316,
407
+ "pass@16": 0.9745693164759084,
408
+ "pass@32": 0.9818461284408381,
409
+ "pass@64": 0.9871328155545905,
410
+ "pass@128": 0.991044776119403
411
+ },
412
+ "2": {
413
+ "pass@1": 0.9998660714285714,
414
+ "pass@2": 0.9999992969628797,
415
+ "pass@4": 1.0,
416
+ "pass@8": 1.0,
417
+ "pass@16": 1.0,
418
+ "pass@32": 1.0,
419
+ "pass@64": 1.0,
420
+ "pass@128": 1.0
421
+ },
422
+ "3": {
423
+ "pass@1": 0.9907069970845481,
424
+ "pass@2": 0.9970669721769471,
425
+ "pass@4": 0.9993563997831172,
426
+ "pass@8": 0.9999514691641971,
427
+ "pass@16": 0.9999996499221584,
428
+ "pass@32": 0.9999999999946803,
429
+ "pass@64": 1.0,
430
+ "pass@128": 1.0
431
+ },
432
+ "4": {
433
+ "pass@1": 0.9958196271929824,
434
+ "pass@2": 0.9977559342911083,
435
+ "pass@4": 0.9990211264710332,
436
+ "pass@8": 0.9997663197662907,
437
+ "pass@16": 0.9999856176620865,
438
+ "pass@32": 0.9999999740185194,
439
+ "pass@64": 0.9999999999999997,
440
+ "pass@128": 1.0
441
+ },
442
+ "5": {
443
+ "pass@1": 0.9548943014705882,
444
+ "pass@2": 0.9685079174386292,
445
+ "pass@4": 0.9784535587463331,
446
+ "pass@8": 0.9869293632462349,
447
+ "pass@16": 0.9934826994856965,
448
+ "pass@32": 0.9970310861536675,
449
+ "pass@64": 0.9985066302193607,
450
+ "pass@128": 1.0
451
+ },
452
+ "6": {
453
+ "pass@1": 0.9775483044164038,
454
+ "pass@2": 0.9891293716932861,
455
+ "pass@4": 0.9959075131381444,
456
+ "pass@8": 0.9991092920224423,
457
+ "pass@16": 0.9999352040645176,
458
+ "pass@32": 0.9999996662295679,
459
+ "pass@64": 0.9999999999997393,
460
+ "pass@128": 1.0
461
+ },
462
+ "7": {
463
+ "pass@1": 0.9744115901898734,
464
+ "pass@2": 0.9867456923402768,
465
+ "pass@4": 0.9912063912303684,
466
+ "pass@8": 0.9935151685224703,
467
+ "pass@16": 0.9954041601504686,
468
+ "pass@32": 0.9972167179953726,
469
+ "pass@64": 0.9990265956175474,
470
+ "pass@128": 1.0
471
+ },
472
+ "8": {
473
+ "pass@1": 0.9515531156156156,
474
+ "pass@2": 0.9743795222387744,
475
+ "pass@4": 0.9852795146102226,
476
+ "pass@8": 0.9899431408996546,
477
+ "pass@16": 0.9925003924523234,
478
+ "pass@32": 0.9950902478963195,
479
+ "pass@64": 0.9976578190444356,
480
+ "pass@128": 1.0
481
+ },
482
+ "9": {
483
+ "pass@1": 0.9379521704180064,
484
+ "pass@2": 0.9720591912043948,
485
+ "pass@4": 0.9858236963747689,
486
+ "pass@8": 0.9932814976290268,
487
+ "pass@16": 0.9968830356374474,
488
+ "pass@32": 0.9981725276413653,
489
+ "pass@64": 0.9992024635603814,
490
+ "pass@128": 1.0
491
+ }
492
+ }
493
+ }
494
+ }
495
+ },
496
+ "ood": {
497
+ "answer_accuracy": 0.25220703125,
498
+ "avg_response_len": 284.85050390625,
499
+ "avg_loss": 5.825457085227966,
500
+ "count": 1280000,
501
+ "validation_metrics": {},
502
+ "pass_at_k": {
503
+ "pass@1": 0.25220703125,
504
+ "pass@2": 0.3026185285433113,
505
+ "pass@4": 0.34743800927071905,
506
+ "pass@8": 0.38810959550625124,
507
+ "pass@16": 0.42553433372274274,
508
+ "pass@32": 0.4601746364651569,
509
+ "pass@64": 0.49276390746466214,
510
+ "pass@128": 0.5238
511
+ },
512
+ "per_op_pass_at_k": {
513
+ "11": {
514
+ "pass@1": 0.7491953125,
515
+ "pass@2": 0.8267459399606308,
516
+ "pass@4": 0.8755285344956877,
517
+ "pass@8": 0.907201192042703,
518
+ "pass@16": 0.9291708873567505,
519
+ "pass@32": 0.9454073394478061,
520
+ "pass@64": 0.9574549244959281,
521
+ "pass@128": 0.967
522
+ },
523
+ "12": {
524
+ "pass@1": 0.42375,
525
+ "pass@2": 0.5180041830708659,
526
+ "pass@4": 0.593009113235845,
527
+ "pass@8": 0.6523426553791513,
528
+ "pass@16": 0.702357863892847,
529
+ "pass@32": 0.745694824334505,
530
+ "pass@64": 0.7831121214310477,
531
+ "pass@128": 0.816
532
+ },
533
+ "13": {
534
+ "pass@1": 0.2113984375,
535
+ "pass@2": 0.25608120078740143,
536
+ "pass@4": 0.3006886397637793,
537
+ "pass@8": 0.3447306794463938,
538
+ "pass@16": 0.38799260684503434,
539
+ "pass@32": 0.43170338099033495,
540
+ "pass@64": 0.4761056180149899,
541
+ "pass@128": 0.519
542
+ },
543
+ "14": {
544
+ "pass@1": 0.200171875,
545
+ "pass@2": 0.2479451279527559,
546
+ "pass@4": 0.29500759055118103,
547
+ "pass@8": 0.34113887630208256,
548
+ "pass@16": 0.38481813556379674,
549
+ "pass@32": 0.4260044402977936,
550
+ "pass@64": 0.4671482956865662,
551
+ "pass@128": 0.51
552
+ },
553
+ "15": {
554
+ "pass@1": 0.17846875,
555
+ "pass@2": 0.23154035433070885,
556
+ "pass@4": 0.28250203355830483,
557
+ "pass@8": 0.3294688278958133,
558
+ "pass@16": 0.3735511399431241,
559
+ "pass@32": 0.4152404384047447,
560
+ "pass@64": 0.4535175002619446,
561
+ "pass@128": 0.485
562
+ },
563
+ "16": {
564
+ "pass@1": 0.158390625,
565
+ "pass@2": 0.1908997293307087,
566
+ "pass@4": 0.22120489895013098,
567
+ "pass@8": 0.2515841984680904,
568
+ "pass@16": 0.2832809698127581,
569
+ "pass@32": 0.3157377911008275,
570
+ "pass@64": 0.3483624231356763,
571
+ "pass@128": 0.383
572
+ },
573
+ "17": {
574
+ "pass@1": 0.15778125,
575
+ "pass@2": 0.19822810039370062,
576
+ "pass@4": 0.23773755933633278,
577
+ "pass@8": 0.2756691021575299,
578
+ "pass@16": 0.31012296220189073,
579
+ "pass@32": 0.34113294654797527,
580
+ "pass@64": 0.370247222269346,
581
+ "pass@128": 0.399
582
+ },
583
+ "18": {
584
+ "pass@1": 0.1607265625,
585
+ "pass@2": 0.1996240157480314,
586
+ "pass@4": 0.23766353665166848,
587
+ "pass@8": 0.27507361063539393,
588
+ "pass@16": 0.31047573197015055,
589
+ "pass@32": 0.342168030163979,
590
+ "pass@64": 0.37123232750738133,
591
+ "pass@128": 0.4
592
+ },
593
+ "19": {
594
+ "pass@1": 0.142859375,
595
+ "pass@2": 0.17847379429133842,
596
+ "pass@4": 0.21176526481064853,
597
+ "pass@8": 0.24360403095909056,
598
+ "pass@16": 0.27449561265010364,
599
+ "pass@32": 0.305123051282576,
600
+ "pass@64": 0.3363755503535671,
601
+ "pass@128": 0.369
602
+ },
603
+ "20": {
604
+ "pass@1": 0.139328125,
605
+ "pass@2": 0.17864283956692892,
606
+ "pass@4": 0.2192729213535806,
607
+ "pass@8": 0.26028278177620573,
608
+ "pass@16": 0.2990774269909148,
609
+ "pass@32": 0.33353412208103916,
610
+ "pass@64": 0.3640830914901839,
611
+ "pass@128": 0.39
612
+ }
613
+ },
614
+ "per_op_accuracy": {
615
+ "11": 0.7491953125,
616
+ "12": 0.42375,
617
+ "13": 0.2113984375,
618
+ "14": 0.200171875,
619
+ "15": 0.17846875,
620
+ "16": 0.158390625,
621
+ "17": 0.15778125,
622
+ "18": 0.1607265625,
623
+ "19": 0.142859375,
624
+ "20": 0.139328125
625
+ },
626
+ "per_op_avg_response_len": {
627
+ "11": 291.4343515625,
628
+ "12": 288.3101484375,
629
+ "13": 282.0749140625,
630
+ "14": 284.6130234375,
631
+ "15": 282.3583984375,
632
+ "16": 284.9855,
633
+ "17": 285.6167265625,
634
+ "18": 284.2999609375,
635
+ "19": 281.3561640625,
636
+ "20": 283.4558515625
637
+ },
638
+ "per_op_avg_loss": {
639
+ "11": 0.1496517333984375,
640
+ "12": 0.160320068359375,
641
+ "13": 0.18829949951171876,
642
+ "14": 0.2179671630859375,
643
+ "15": 0.24814111328125,
644
+ "16": 0.280107177734375,
645
+ "17": 0.3107010498046875,
646
+ "18": 0.3399879150390625,
647
+ "19": 0.40024169921875,
648
+ "20": 0.419079345703125
649
+ },
650
+ "per_op_length": {
651
+ "11": 291.4343515625,
652
+ "12": 288.3101484375,
653
+ "13": 282.0749140625,
654
+ "14": 284.6130234375,
655
+ "15": 282.3583984375,
656
+ "16": 284.9855,
657
+ "17": 285.6167265625,
658
+ "18": 284.2999609375,
659
+ "19": 281.3561640625,
660
+ "20": 283.4558515625
661
+ },
662
+ "per_op_loss": {
663
+ "11": 0.1496517333984375,
664
+ "12": 0.160320068359375,
665
+ "13": 0.18829949951171876,
666
+ "14": 0.2179671630859375,
667
+ "15": 0.24814111328125,
668
+ "16": 0.280107177734375,
669
+ "17": 0.3107010498046875,
670
+ "18": 0.3399879150390625,
671
+ "19": 0.40024169921875,
672
+ "20": 0.419079345703125
673
+ },
674
+ "per_template": {
675
+ "crazy_zootopia": {
676
+ "count": 430720,
677
+ "correct": 113067,
678
+ "answer_accuracy": 0.2625069650817236,
679
+ "avg_response_len": 273.8302725668648,
680
+ "resp_tokens_sum": 117944175,
681
+ "pass_at_k": {
682
+ "pass@1": 0.2625069650817236,
683
+ "pass@2": 0.31366907342841527,
684
+ "pass@4": 0.3583084888720263,
685
+ "pass@8": 0.3983932569435031,
686
+ "pass@16": 0.4355776055864389,
687
+ "pass@32": 0.47017903805797134,
688
+ "pass@64": 0.5024705719566711,
689
+ "pass@128": 0.5337295690936107
690
+ },
691
+ "per_op_pass_at_k": {
692
+ "11": {
693
+ "pass@1": 0.7521689093484419,
694
+ "pass@2": 0.8267172826838576,
695
+ "pass@4": 0.8749963240253604,
696
+ "pass@8": 0.9072539747181511,
697
+ "pass@16": 0.9305891455160197,
698
+ "pass@32": 0.9482171914771189,
699
+ "pass@64": 0.9598296757598728,
700
+ "pass@128": 0.9660056657223796
701
+ },
702
+ "12": {
703
+ "pass@1": 0.4268626412429379,
704
+ "pass@2": 0.5226548250589441,
705
+ "pass@4": 0.5947451512946471,
706
+ "pass@8": 0.6499540700619184,
707
+ "pass@16": 0.697671407935311,
708
+ "pass@32": 0.7391764349175001,
709
+ "pass@64": 0.7728700105439386,
710
+ "pass@128": 0.8022598870056498
711
+ },
712
+ "13": {
713
+ "pass@1": 0.226048197492163,
714
+ "pass@2": 0.2745025516500875,
715
+ "pass@4": 0.32170087379359735,
716
+ "pass@8": 0.36726090847720233,
717
+ "pass@16": 0.4118837384913241,
718
+ "pass@32": 0.4577964268716136,
719
+ "pass@64": 0.5060004900301981,
720
+ "pass@128": 0.554858934169279
721
+ },
722
+ "14": {
723
+ "pass@1": 0.1970404984423676,
724
+ "pass@2": 0.2447659871955257,
725
+ "pass@4": 0.2905214605183697,
726
+ "pass@8": 0.33432116638118264,
727
+ "pass@16": 0.37689919457954874,
728
+ "pass@32": 0.41849874775311363,
729
+ "pass@64": 0.46201871752633594,
730
+ "pass@128": 0.5109034267912772
731
+ },
732
+ "15": {
733
+ "pass@1": 0.190774024566474,
734
+ "pass@2": 0.24494681910245306,
735
+ "pass@4": 0.29680882700356104,
736
+ "pass@8": 0.3440605324830605,
737
+ "pass@16": 0.38846742937996726,
738
+ "pass@32": 0.43008799604881653,
739
+ "pass@64": 0.4654618905817254,
740
+ "pass@128": 0.4913294797687861
741
+ },
742
+ "16": {
743
+ "pass@1": 0.1630796370967742,
744
+ "pass@2": 0.1964297053594107,
745
+ "pass@4": 0.23033826426938572,
746
+ "pass@8": 0.2660781043146322,
747
+ "pass@16": 0.3019958633560044,
748
+ "pass@32": 0.33569879011745574,
749
+ "pass@64": 0.3686862876847773,
750
+ "pass@128": 0.4064516129032258
751
+ },
752
+ "17": {
753
+ "pass@1": 0.1676300578034682,
754
+ "pass@2": 0.20699551397296437,
755
+ "pass@4": 0.2408987041359715,
756
+ "pass@8": 0.271017564290959,
757
+ "pass@16": 0.29876652985892266,
758
+ "pass@32": 0.3264747152113181,
759
+ "pass@64": 0.3549548508714773,
760
+ "pass@128": 0.38439306358381503
761
+ },
762
+ "18": {
763
+ "pass@1": 0.1616517857142857,
764
+ "pass@2": 0.19595648200224966,
765
+ "pass@4": 0.23106420322459695,
766
+ "pass@8": 0.26888622832851616,
767
+ "pass@16": 0.3071732341417159,
768
+ "pass@32": 0.3420263056744111,
769
+ "pass@64": 0.3714714335108666,
770
+ "pass@128": 0.39714285714285713
771
+ },
772
+ "19": {
773
+ "pass@1": 0.15040822072072071,
774
+ "pass@2": 0.1915057281691139,
775
+ "pass@4": 0.2273034600967671,
776
+ "pass@8": 0.2598221642368353,
777
+ "pass@16": 0.2905735616301428,
778
+ "pass@32": 0.32042698971733125,
779
+ "pass@64": 0.35124146512270843,
780
+ "pass@128": 0.3843843843843844
781
+ },
782
+ "20": {
783
+ "pass@1": 0.15052552552552553,
784
+ "pass@2": 0.18885223806483653,
785
+ "pass@4": 0.22942468621602477,
786
+ "pass@8": 0.26979999128432536,
787
+ "pass@16": 0.30689533475643643,
788
+ "pass@32": 0.3397581402934893,
789
+ "pass@64": 0.370991289408613,
790
+ "pass@128": 0.4024024024024024
791
+ }
792
+ }
793
+ },
794
+ "teachers_in_school": {
795
+ "count": 430464,
796
+ "correct": 107386,
797
+ "answer_accuracy": 0.24946569283377937,
798
+ "avg_response_len": 281.4464322219744,
799
+ "resp_tokens_sum": 121152557,
800
+ "pass_at_k": {
801
+ "pass@1": 0.24946569283377937,
802
+ "pass@2": 0.30024583617224077,
803
+ "pass@4": 0.34601646702614464,
804
+ "pass@8": 0.3878270818735659,
805
+ "pass@16": 0.42574861408875925,
806
+ "pass@32": 0.4604748763552623,
807
+ "pass@64": 0.492562625136877,
808
+ "pass@128": 0.5212607790663099
809
+ },
810
+ "per_op_pass_at_k": {
811
+ "11": {
812
+ "pass@1": 0.7309864457831325,
813
+ "pass@2": 0.811214839673655,
814
+ "pass@4": 0.8645121111555325,
815
+ "pass@8": 0.9001346785530842,
816
+ "pass@16": 0.9235781504819824,
817
+ "pass@32": 0.939017807747773,
818
+ "pass@64": 0.9485292552001391,
819
+ "pass@128": 0.9548192771084337
820
+ },
821
+ "12": {
822
+ "pass@1": 0.439042907523511,
823
+ "pass@2": 0.5317378835188704,
824
+ "pass@4": 0.6075961422729685,
825
+ "pass@8": 0.6683062360710545,
826
+ "pass@16": 0.7177272091161964,
827
+ "pass@32": 0.7598458853246168,
828
+ "pass@64": 0.7942021269976,
829
+ "pass@128": 0.8213166144200627
830
+ },
831
+ "13": {
832
+ "pass@1": 0.19227065826330533,
833
+ "pass@2": 0.23474754074858276,
834
+ "pass@4": 0.27791213361985206,
835
+ "pass@8": 0.32123267772057174,
836
+ "pass@16": 0.36547236780648024,
837
+ "pass@32": 0.412591392141311,
838
+ "pass@64": 0.46083134462740805,
839
+ "pass@128": 0.5042016806722689
840
+ },
841
+ "14": {
842
+ "pass@1": 0.20837902046783627,
843
+ "pass@2": 0.2522699670764838,
844
+ "pass@4": 0.2964683594923442,
845
+ "pass@8": 0.3416618793696011,
846
+ "pass@16": 0.3841336779597966,
847
+ "pass@32": 0.42375806001618826,
848
+ "pass@64": 0.4632207639904677,
849
+ "pass@128": 0.5029239766081871
850
+ },
851
+ "15": {
852
+ "pass@1": 0.19093276515151514,
853
+ "pass@2": 0.24199400501073715,
854
+ "pass@4": 0.2891956014020975,
855
+ "pass@8": 0.3326345282383242,
856
+ "pass@16": 0.37366558323041027,
857
+ "pass@32": 0.4128573661812689,
858
+ "pass@64": 0.4483166035224402,
859
+ "pass@128": 0.4727272727272727
860
+ },
861
+ "16": {
862
+ "pass@1": 0.15642806267806267,
863
+ "pass@2": 0.1904278691926329,
864
+ "pass@4": 0.22132350069489173,
865
+ "pass@8": 0.251268169779741,
866
+ "pass@16": 0.28180434335829546,
867
+ "pass@32": 0.3122157586188942,
868
+ "pass@64": 0.34211985927431315,
869
+ "pass@128": 0.3732193732193732
870
+ },
871
+ "17": {
872
+ "pass@1": 0.16779891304347827,
873
+ "pass@2": 0.21377104526336374,
874
+ "pass@4": 0.2616579632708955,
875
+ "pass@8": 0.3084358591590647,
876
+ "pass@16": 0.3485515761094292,
877
+ "pass@32": 0.38192734451252297,
878
+ "pass@64": 0.4116145438455866,
879
+ "pass@128": 0.43788819875776397
880
+ },
881
+ "18": {
882
+ "pass@1": 0.1518612132352941,
883
+ "pass@2": 0.1958187384205651,
884
+ "pass@4": 0.23828893419572547,
885
+ "pass@8": 0.27971765090438966,
886
+ "pass@16": 0.31909231139592203,
887
+ "pass@32": 0.3541919542192856,
888
+ "pass@64": 0.3879892602530647,
889
+ "pass@128": 0.4235294117647059
890
+ },
891
+ "19": {
892
+ "pass@1": 0.14004371279761904,
893
+ "pass@2": 0.17627703763592056,
894
+ "pass@4": 0.21033196631671033,
895
+ "pass@8": 0.24241868591666452,
896
+ "pass@16": 0.2725452956766527,
897
+ "pass@32": 0.3023936104609465,
898
+ "pass@64": 0.331830773125219,
899
+ "pass@128": 0.3601190476190476
900
+ },
901
+ "20": {
902
+ "pass@1": 0.1367421407185629,
903
+ "pass@2": 0.17771825934744678,
904
+ "pass@4": 0.21898993408009634,
905
+ "pass@8": 0.260066157867715,
906
+ "pass@16": 0.2989195527821917,
907
+ "pass@32": 0.3334463816599912,
908
+ "pass@64": 0.3631967204725436,
909
+ "pass@128": 0.38622754491017963
910
+ }
911
+ }
912
+ },
913
+ "movie_festival_awards": {
914
+ "count": 418816,
915
+ "correct": 102372,
916
+ "answer_accuracy": 0.24443192237163813,
917
+ "avg_response_len": 299.6827079194682,
918
+ "resp_tokens_sum": 125511913,
919
+ "pass_at_k": {
920
+ "pass@1": 0.24443192237163813,
921
+ "pass@2": 0.2936925752748208,
922
+ "pass@4": 0.3377196360158527,
923
+ "pass@8": 0.3778240125631929,
924
+ "pass@16": 0.414985362239759,
925
+ "pass@32": 0.44957728985444445,
926
+ "pass@64": 0.48298823095266347,
927
+ "pass@128": 0.51619804400978
928
+ },
929
+ "per_op_pass_at_k": {
930
+ "11": {
931
+ "pass@1": 0.7650545634920635,
932
+ "pass@2": 0.8431473409573809,
933
+ "pass@4": 0.8877359086066622,
934
+ "pass@8": 0.9145899228176874,
935
+ "pass@16": 0.9334761048558039,
936
+ "pass@32": 0.9489928847114996,
937
+ "pass@64": 0.9642010673849086,
938
+ "pass@128": 0.9809523809523809
939
+ },
940
+ "12": {
941
+ "pass@1": 0.4054615825688073,
942
+ "pass@2": 0.49957183534397626,
943
+ "pass@4": 0.5768995727598271,
944
+ "pass@8": 0.639355428961972,
945
+ "pass@16": 0.6924379381519293,
946
+ "pass@32": 0.7389465717283092,
947
+ "pass@64": 0.7833812207530858,
948
+ "pass@128": 0.8256880733944955
949
+ },
950
+ "13": {
951
+ "pass@1": 0.21805073302469136,
952
+ "pass@2": 0.2614506628511712,
953
+ "pass@4": 0.30509700407819385,
954
+ "pass@8": 0.34843945585161107,
955
+ "pass@16": 0.3892841325907048,
956
+ "pass@32": 0.42707158587605587,
957
+ "pass@64": 0.46350207303509766,
958
+ "pass@128": 0.5
959
+ },
960
+ "14": {
961
+ "pass@1": 0.19482566765578635,
962
+ "pass@2": 0.24658432440010283,
963
+ "pass@4": 0.2977982871762691,
964
+ "pass@8": 0.34710213397424194,
965
+ "pass@16": 0.39305571585018145,
966
+ "pass@32": 0.4354334888531387,
967
+ "pass@64": 0.4760201367236571,
968
+ "pass@128": 0.516320474777448
969
+ },
970
+ "15": {
971
+ "pass@1": 0.15263310185185186,
972
+ "pass@2": 0.20657633724603863,
973
+ "pass@4": 0.26040626837154607,
974
+ "pass@8": 0.31066200412971595,
975
+ "pass@16": 0.3575054534309268,
976
+ "pass@32": 0.40181191645690056,
977
+ "pass@64": 0.4460592807353767,
978
+ "pass@128": 0.49074074074074076
979
+ },
980
+ "16": {
981
+ "pass@1": 0.15613477138643067,
982
+ "pass@2": 0.1863313822497852,
983
+ "pass@4": 0.21273005393131156,
984
+ "pass@8": 0.23865739981671225,
985
+ "pass@16": 0.2676959517806339,
986
+ "pass@32": 0.3011310763692745,
987
+ "pass@64": 0.3362407178410369,
988
+ "pass@128": 0.37168141592920356
989
+ },
990
+ "17": {
991
+ "pass@1": 0.1378012048192771,
992
+ "pass@2": 0.17401619272365054,
993
+ "pass@4": 0.21124320340981484,
994
+ "pass@8": 0.24873698272180494,
995
+ "pass@16": 0.28468715470923933,
996
+ "pass@32": 0.31684364503562207,
997
+ "pass@64": 0.3460631347877581,
998
+ "pass@128": 0.37650602409638556
999
+ },
1000
+ "18": {
1001
+ "pass@1": 0.16940524193548387,
1002
+ "pass@2": 0.20793830962661927,
1003
+ "pass@4": 0.24442847708552565,
1004
+ "pass@8": 0.27696590133200394,
1005
+ "pass@16": 0.3047539165997947,
1006
+ "pass@32": 0.329140512075412,
1007
+ "pass@64": 0.3525837977178592,
1008
+ "pass@128": 0.3774193548387097
1009
+ },
1010
+ "19": {
1011
+ "pass@1": 0.13812311178247735,
1012
+ "pass@2": 0.16759305790137258,
1013
+ "pass@4": 0.19758813267676884,
1014
+ "pass@8": 0.2284911534750015,
1015
+ "pass@16": 0.2603002938970115,
1016
+ "pass@32": 0.29249731296624343,
1017
+ "pass@64": 0.3260332407783438,
1018
+ "pass@128": 0.36253776435045315
1019
+ },
1020
+ "20": {
1021
+ "pass@1": 0.13072447447447447,
1022
+ "pass@2": 0.16936079780567978,
1023
+ "pass@4": 0.2094049935762534,
1024
+ "pass@8": 0.25098284669882537,
1025
+ "pass@16": 0.291417867530838,
1026
+ "pass@32": 0.32739810777438544,
1027
+ "pass@64": 0.35806392636422313,
1028
+ "pass@128": 0.3813813813813814
1029
+ }
1030
+ }
1031
+ }
1032
+ }
1033
+ },
1034
+ "total": {
1035
+ "answer_accuracy": 0.588914884868421,
1036
+ "avg_loss": 6.166780625293129,
1037
+ "count": 2432000,
1038
+ "validation_metrics": {
1039
+ "id": {},
1040
+ "ood": {}
1041
+ },
1042
+ "pass_at_k": {
1043
+ "pass@1": 0.588914884868421,
1044
+ "pass@2": 0.6227466328221948,
1045
+ "pass@4": 0.6499032629408014,
1046
+ "pass@8": 0.6734744359367607,
1047
+ "pass@16": 0.6946532134389145,
1048
+ "pass@32": 0.7139263397862728,
1049
+ "pass@64": 0.7317913186635674,
1050
+ "pass@128": 0.748578947368421
1051
+ },
1052
+ "per_op_pass_at_k": {
1053
+ "10": {
1054
+ "pass@1": 0.89534375,
1055
+ "pass@2": 0.9316516978346466,
1056
+ "pass@4": 0.9502444778777662,
1057
+ "pass@8": 0.9628981587488723,
1058
+ "pass@16": 0.9732652153333301,
1059
+ "pass@32": 0.9817272204905513,
1060
+ "pass@64": 0.9875005300252477,
1061
+ "pass@128": 0.991
1062
+ },
1063
+ "2": {
1064
+ "pass@1": 0.9994453125,
1065
+ "pass@2": 0.9998784448818898,
1066
+ "pass@4": 0.999987270247469,
1067
+ "pass@8": 0.9999998760353234,
1068
+ "pass@16": 0.9999999999955357,
1069
+ "pass@32": 1.0,
1070
+ "pass@64": 1.0,
1071
+ "pass@128": 1.0
1072
+ },
1073
+ "3": {
1074
+ "pass@1": 0.9896953125,
1075
+ "pass@2": 0.9956350885826775,
1076
+ "pass@4": 0.998029471128609,
1077
+ "pass@8": 0.9990296020970205,
1078
+ "pass@16": 0.9994722941030695,
1079
+ "pass@32": 0.999768854949753,
1080
+ "pass@64": 0.9999711811022791,
1081
+ "pass@128": 1.0
1082
+ },
1083
+ "4": {
1084
+ "pass@1": 0.9968125,
1085
+ "pass@2": 0.9986126968503938,
1086
+ "pass@4": 0.9994299414135731,
1087
+ "pass@8": 0.9998706652443888,
1088
+ "pass@16": 0.999993190320664,
1089
+ "pass@32": 0.9999999900668188,
1090
+ "pass@64": 0.9999999999999999,
1091
+ "pass@128": 1.0
1092
+ },
1093
+ "5": {
1094
+ "pass@1": 0.95453125,
1095
+ "pass@2": 0.9673741387795279,
1096
+ "pass@4": 0.9768552553430826,
1097
+ "pass@8": 0.9845504029004547,
1098
+ "pass@16": 0.9905860749032567,
1099
+ "pass@32": 0.9947039335217139,
1100
+ "pass@64": 0.9972231304095797,
1101
+ "pass@128": 0.999
1102
+ },
1103
+ "6": {
1104
+ "pass@1": 0.9755546875,
1105
+ "pass@2": 0.9871010088582685,
1106
+ "pass@4": 0.9931382349081366,
1107
+ "pass@8": 0.9961125829051054,
1108
+ "pass@16": 0.9972334916240602,
1109
+ "pass@32": 0.9976868579347262,
1110
+ "pass@64": 0.998251968426333,
1111
+ "pass@128": 0.999
1112
+ },
1113
+ "7": {
1114
+ "pass@1": 0.9665546875,
1115
+ "pass@2": 0.980099409448819,
1116
+ "pass@4": 0.9868899991563556,
1117
+ "pass@8": 0.9912036930431989,
1118
+ "pass@16": 0.9940644245074017,
1119
+ "pass@32": 0.9960585379187011,
1120
+ "pass@64": 0.9974804599031311,
1121
+ "pass@128": 0.998
1122
+ },
1123
+ "8": {
1124
+ "pass@1": 0.95003125,
1125
+ "pass@2": 0.9737233021653545,
1126
+ "pass@4": 0.984498448537683,
1127
+ "pass@8": 0.9902565507718185,
1128
+ "pass@16": 0.9939245919129133,
1129
+ "pass@32": 0.996586911336722,
1130
+ "pass@64": 0.998538493317924,
1131
+ "pass@128": 1.0
1132
+ },
1133
+ "9": {
1134
+ "pass@1": 0.93934375,
1135
+ "pass@2": 0.9719249507874034,
1136
+ "pass@4": 0.98470880455568,
1137
+ "pass@8": 0.990996795989665,
1138
+ "pass@16": 0.9945284354115372,
1139
+ "pass@32": 0.9963217850684021,
1140
+ "pass@64": 0.9974302167766392,
1141
+ "pass@128": 0.998
1142
+ },
1143
+ "11": {
1144
+ "pass@1": 0.7491953125,
1145
+ "pass@2": 0.8267459399606308,
1146
+ "pass@4": 0.8755285344956877,
1147
+ "pass@8": 0.907201192042703,
1148
+ "pass@16": 0.9291708873567505,
1149
+ "pass@32": 0.9454073394478061,
1150
+ "pass@64": 0.9574549244959281,
1151
+ "pass@128": 0.967
1152
+ },
1153
+ "12": {
1154
+ "pass@1": 0.42375,
1155
+ "pass@2": 0.5180041830708659,
1156
+ "pass@4": 0.593009113235845,
1157
+ "pass@8": 0.6523426553791513,
1158
+ "pass@16": 0.702357863892847,
1159
+ "pass@32": 0.745694824334505,
1160
+ "pass@64": 0.7831121214310477,
1161
+ "pass@128": 0.816
1162
+ },
1163
+ "13": {
1164
+ "pass@1": 0.2113984375,
1165
+ "pass@2": 0.25608120078740143,
1166
+ "pass@4": 0.3006886397637793,
1167
+ "pass@8": 0.3447306794463938,
1168
+ "pass@16": 0.38799260684503434,
1169
+ "pass@32": 0.43170338099033495,
1170
+ "pass@64": 0.4761056180149899,
1171
+ "pass@128": 0.519
1172
+ },
1173
+ "14": {
1174
+ "pass@1": 0.200171875,
1175
+ "pass@2": 0.2479451279527559,
1176
+ "pass@4": 0.29500759055118103,
1177
+ "pass@8": 0.34113887630208256,
1178
+ "pass@16": 0.38481813556379674,
1179
+ "pass@32": 0.4260044402977936,
1180
+ "pass@64": 0.4671482956865662,
1181
+ "pass@128": 0.51
1182
+ },
1183
+ "15": {
1184
+ "pass@1": 0.17846875,
1185
+ "pass@2": 0.23154035433070885,
1186
+ "pass@4": 0.28250203355830483,
1187
+ "pass@8": 0.3294688278958133,
1188
+ "pass@16": 0.3735511399431241,
1189
+ "pass@32": 0.4152404384047447,
1190
+ "pass@64": 0.4535175002619446,
1191
+ "pass@128": 0.485
1192
+ },
1193
+ "16": {
1194
+ "pass@1": 0.158390625,
1195
+ "pass@2": 0.1908997293307087,
1196
+ "pass@4": 0.22120489895013098,
1197
+ "pass@8": 0.2515841984680904,
1198
+ "pass@16": 0.2832809698127581,
1199
+ "pass@32": 0.3157377911008275,
1200
+ "pass@64": 0.3483624231356763,
1201
+ "pass@128": 0.383
1202
+ },
1203
+ "17": {
1204
+ "pass@1": 0.15778125,
1205
+ "pass@2": 0.19822810039370062,
1206
+ "pass@4": 0.23773755933633278,
1207
+ "pass@8": 0.2756691021575299,
1208
+ "pass@16": 0.31012296220189073,
1209
+ "pass@32": 0.34113294654797527,
1210
+ "pass@64": 0.370247222269346,
1211
+ "pass@128": 0.399
1212
+ },
1213
+ "18": {
1214
+ "pass@1": 0.1607265625,
1215
+ "pass@2": 0.1996240157480314,
1216
+ "pass@4": 0.23766353665166848,
1217
+ "pass@8": 0.27507361063539393,
1218
+ "pass@16": 0.31047573197015055,
1219
+ "pass@32": 0.342168030163979,
1220
+ "pass@64": 0.37123232750738133,
1221
+ "pass@128": 0.4
1222
+ },
1223
+ "19": {
1224
+ "pass@1": 0.142859375,
1225
+ "pass@2": 0.17847379429133842,
1226
+ "pass@4": 0.21176526481064853,
1227
+ "pass@8": 0.24360403095909056,
1228
+ "pass@16": 0.27449561265010364,
1229
+ "pass@32": 0.305123051282576,
1230
+ "pass@64": 0.3363755503535671,
1231
+ "pass@128": 0.369
1232
+ },
1233
+ "20": {
1234
+ "pass@1": 0.139328125,
1235
+ "pass@2": 0.17864283956692892,
1236
+ "pass@4": 0.2192729213535806,
1237
+ "pass@8": 0.26028278177620573,
1238
+ "pass@16": 0.2990774269909148,
1239
+ "pass@32": 0.33353412208103916,
1240
+ "pass@64": 0.3640830914901839,
1241
+ "pass@128": 0.39
1242
+ }
1243
+ },
1244
+ "per_op_accuracy": {
1245
+ "10": 0.89534375,
1246
+ "2": 0.9994453125,
1247
+ "3": 0.9896953125,
1248
+ "4": 0.9968125,
1249
+ "5": 0.95453125,
1250
+ "6": 0.9755546875,
1251
+ "7": 0.9665546875,
1252
+ "8": 0.95003125,
1253
+ "9": 0.93934375,
1254
+ "11": 0.7491953125,
1255
+ "12": 0.42375,
1256
+ "13": 0.2113984375,
1257
+ "14": 0.200171875,
1258
+ "15": 0.17846875,
1259
+ "16": 0.158390625,
1260
+ "17": 0.15778125,
1261
+ "18": 0.1607265625,
1262
+ "19": 0.142859375,
1263
+ "20": 0.139328125
1264
+ },
1265
+ "per_op_avg_response_len": {
1266
+ "10": 279.685734375,
1267
+ "2": 116.979234375,
1268
+ "3": 147.114546875,
1269
+ "4": 160.9594765625,
1270
+ "5": 176.0514453125,
1271
+ "6": 200.8557890625,
1272
+ "7": 225.0676796875,
1273
+ "8": 244.2956640625,
1274
+ "9": 259.9049921875,
1275
+ "11": 291.4343515625,
1276
+ "12": 288.3101484375,
1277
+ "13": 282.0749140625,
1278
+ "14": 284.6130234375,
1279
+ "15": 282.3583984375,
1280
+ "16": 284.9855,
1281
+ "17": 285.6167265625,
1282
+ "18": 284.2999609375,
1283
+ "19": 281.3561640625,
1284
+ "20": 283.4558515625
1285
+ },
1286
+ "per_op_avg_loss": {
1287
+ "10": 0.13990753173828124,
1288
+ "2": 0.15822406005859374,
1289
+ "3": 0.15319696044921874,
1290
+ "4": 0.1460938720703125,
1291
+ "5": 0.1487813720703125,
1292
+ "6": 0.14940606689453126,
1293
+ "7": 0.14932769775390625,
1294
+ "8": 0.14807330322265624,
1295
+ "9": 0.1509442138671875,
1296
+ "11": 0.1496517333984375,
1297
+ "12": 0.160320068359375,
1298
+ "13": 0.18829949951171876,
1299
+ "14": 0.2179671630859375,
1300
+ "15": 0.24814111328125,
1301
+ "16": 0.280107177734375,
1302
+ "17": 0.3107010498046875,
1303
+ "18": 0.3399879150390625,
1304
+ "19": 0.40024169921875,
1305
+ "20": 0.419079345703125
1306
+ },
1307
+ "per_op_length": {
1308
+ "10": 279.685734375,
1309
+ "2": 116.979234375,
1310
+ "3": 147.114546875,
1311
+ "4": 160.9594765625,
1312
+ "5": 176.0514453125,
1313
+ "6": 200.8557890625,
1314
+ "7": 225.0676796875,
1315
+ "8": 244.2956640625,
1316
+ "9": 259.9049921875,
1317
+ "11": 291.4343515625,
1318
+ "12": 288.3101484375,
1319
+ "13": 282.0749140625,
1320
+ "14": 284.6130234375,
1321
+ "15": 282.3583984375,
1322
+ "16": 284.9855,
1323
+ "17": 285.6167265625,
1324
+ "18": 284.2999609375,
1325
+ "19": 281.3561640625,
1326
+ "20": 283.4558515625
1327
+ },
1328
+ "per_op_loss": {
1329
+ "10": 0.13990753173828124,
1330
+ "2": 0.15822406005859374,
1331
+ "3": 0.15319696044921874,
1332
+ "4": 0.1460938720703125,
1333
+ "5": 0.1487813720703125,
1334
+ "6": 0.14940606689453126,
1335
+ "7": 0.14932769775390625,
1336
+ "8": 0.14807330322265624,
1337
+ "9": 0.1509442138671875,
1338
+ "11": 0.1496517333984375,
1339
+ "12": 0.160320068359375,
1340
+ "13": 0.18829949951171876,
1341
+ "14": 0.2179671630859375,
1342
+ "15": 0.24814111328125,
1343
+ "16": 0.280107177734375,
1344
+ "17": 0.3107010498046875,
1345
+ "18": 0.3399879150390625,
1346
+ "19": 0.40024169921875,
1347
+ "20": 0.419079345703125
1348
+ },
1349
+ "per_template": {
1350
+ "crazy_zootopia": {
1351
+ "count": 809728,
1352
+ "correct": 477128,
1353
+ "answer_accuracy": 0.5892447834334492,
1354
+ "avg_response_len": 235.475637498024,
1355
+ "resp_tokens_sum": 190671217,
1356
+ "pass_at_k": {
1357
+ "pass@1": 0.5892447834334492,
1358
+ "pass@2": 0.6242494417489666,
1359
+ "pass@4": 0.6517067209785661,
1360
+ "pass@8": 0.6752650274111465,
1361
+ "pass@16": 0.6966098480320178,
1362
+ "pass@32": 0.7161446042348242,
1363
+ "pass@64": 0.7340453705919688,
1364
+ "pass@128": 0.7510275055327221
1365
+ },
1366
+ "per_op_pass_at_k": {
1367
+ "10": {
1368
+ "pass@1": 0.8994976032448377,
1369
+ "pass@2": 0.9366747090795067,
1370
+ "pass@4": 0.954640098361598,
1371
+ "pass@8": 0.9671935395338649,
1372
+ "pass@16": 0.9780409476989385,
1373
+ "pass@32": 0.9873672705205916,
1374
+ "pass@64": 0.9931124421272916,
1375
+ "pass@128": 0.9941002949852508
1376
+ },
1377
+ "2": {
1378
+ "pass@1": 0.9997322819314641,
1379
+ "pass@2": 0.9999915679348491,
1380
+ "pass@4": 0.9999999897793149,
1381
+ "pass@8": 1.0,
1382
+ "pass@16": 1.0,
1383
+ "pass@32": 1.0,
1384
+ "pass@64": 1.0,
1385
+ "pass@128": 1.0
1386
+ },
1387
+ "3": {
1388
+ "pass@1": 0.9875525611620795,
1389
+ "pass@2": 0.9937483445303281,
1390
+ "pass@4": 0.9960841566249173,
1391
+ "pass@8": 0.9973945614382809,
1392
+ "pass@16": 0.9984052318072489,
1393
+ "pass@32": 0.9992931754002767,
1394
+ "pass@64": 0.9999118688143109,
1395
+ "pass@128": 1.0
1396
+ },
1397
+ "4": {
1398
+ "pass@1": 0.996337890625,
1399
+ "pass@2": 0.9982602577509845,
1400
+ "pass@4": 0.9992694307742784,
1401
+ "pass@8": 0.9998455780548049,
1402
+ "pass@16": 0.9999940908757206,
1403
+ "pass@32": 0.9999999967265157,
1404
+ "pass@64": 1.0,
1405
+ "pass@128": 1.0
1406
+ },
1407
+ "5": {
1408
+ "pass@1": 0.9481150793650793,
1409
+ "pass@2": 0.9641533089613797,
1410
+ "pass@4": 0.9757381440415184,
1411
+ "pass@8": 0.9837977641761237,
1412
+ "pass@16": 0.9891567790393069,
1413
+ "pass@32": 0.9924331098404072,
1414
+ "pass@64": 0.9948057827917357,
1415
+ "pass@128": 0.9968253968253968
1416
+ },
1417
+ "6": {
1418
+ "pass@1": 0.9665746631736527,
1419
+ "pass@2": 0.9810815131547932,
1420
+ "pass@4": 0.9889092237347574,
1421
+ "pass@8": 0.992817054998713,
1422
+ "pass@16": 0.9942118482404367,
1423
+ "pass@32": 0.9947566055669427,
1424
+ "pass@64": 0.9955089818359227,
1425
+ "pass@128": 0.9970059880239521
1426
+ },
1427
+ "7": {
1428
+ "pass@1": 0.9575397559171598,
1429
+ "pass@2": 0.9706686361412663,
1430
+ "pass@4": 0.9781720887884576,
1431
+ "pass@8": 0.9836457175352227,
1432
+ "pass@16": 0.9877211930993773,
1433
+ "pass@32": 0.991030454111008,
1434
+ "pass@64": 0.9934560055493304,
1435
+ "pass@128": 0.9940828402366864
1436
+ },
1437
+ "8": {
1438
+ "pass@1": 0.9534755608974359,
1439
+ "pass@2": 0.9758759684786995,
1440
+ "pass@4": 0.9847553208493165,
1441
+ "pass@8": 0.9897007291187125,
1442
+ "pass@16": 0.9935694000922215,
1443
+ "pass@32": 0.9965151373391712,
1444
+ "pass@64": 0.998300039303945,
1445
+ "pass@128": 1.0
1446
+ },
1447
+ "9": {
1448
+ "pass@1": 0.9408450704225352,
1449
+ "pass@2": 0.9764160751913055,
1450
+ "pass@4": 0.9897490211610868,
1451
+ "pass@8": 0.9954928597779902,
1452
+ "pass@16": 0.9986146363237696,
1453
+ "pass@32": 0.9998472651017836,
1454
+ "pass@64": 0.9999990904114461,
1455
+ "pass@128": 1.0
1456
+ },
1457
+ "11": {
1458
+ "pass@1": 0.7521689093484419,
1459
+ "pass@2": 0.8267172826838576,
1460
+ "pass@4": 0.8749963240253604,
1461
+ "pass@8": 0.9072539747181511,
1462
+ "pass@16": 0.9305891455160197,
1463
+ "pass@32": 0.9482171914771189,
1464
+ "pass@64": 0.9598296757598728,
1465
+ "pass@128": 0.9660056657223796
1466
+ },
1467
+ "12": {
1468
+ "pass@1": 0.4268626412429379,
1469
+ "pass@2": 0.5226548250589441,
1470
+ "pass@4": 0.5947451512946471,
1471
+ "pass@8": 0.6499540700619184,
1472
+ "pass@16": 0.697671407935311,
1473
+ "pass@32": 0.7391764349175001,
1474
+ "pass@64": 0.7728700105439386,
1475
+ "pass@128": 0.8022598870056498
1476
+ },
1477
+ "13": {
1478
+ "pass@1": 0.226048197492163,
1479
+ "pass@2": 0.2745025516500875,
1480
+ "pass@4": 0.32170087379359735,
1481
+ "pass@8": 0.36726090847720233,
1482
+ "pass@16": 0.4118837384913241,
1483
+ "pass@32": 0.4577964268716136,
1484
+ "pass@64": 0.5060004900301981,
1485
+ "pass@128": 0.554858934169279
1486
+ },
1487
+ "14": {
1488
+ "pass@1": 0.1970404984423676,
1489
+ "pass@2": 0.2447659871955257,
1490
+ "pass@4": 0.2905214605183697,
1491
+ "pass@8": 0.33432116638118264,
1492
+ "pass@16": 0.37689919457954874,
1493
+ "pass@32": 0.41849874775311363,
1494
+ "pass@64": 0.46201871752633594,
1495
+ "pass@128": 0.5109034267912772
1496
+ },
1497
+ "15": {
1498
+ "pass@1": 0.190774024566474,
1499
+ "pass@2": 0.24494681910245306,
1500
+ "pass@4": 0.29680882700356104,
1501
+ "pass@8": 0.3440605324830605,
1502
+ "pass@16": 0.38846742937996726,
1503
+ "pass@32": 0.43008799604881653,
1504
+ "pass@64": 0.4654618905817254,
1505
+ "pass@128": 0.4913294797687861
1506
+ },
1507
+ "16": {
1508
+ "pass@1": 0.1630796370967742,
1509
+ "pass@2": 0.1964297053594107,
1510
+ "pass@4": 0.23033826426938572,
1511
+ "pass@8": 0.2660781043146322,
1512
+ "pass@16": 0.3019958633560044,
1513
+ "pass@32": 0.33569879011745574,
1514
+ "pass@64": 0.3686862876847773,
1515
+ "pass@128": 0.4064516129032258
1516
+ },
1517
+ "17": {
1518
+ "pass@1": 0.1676300578034682,
1519
+ "pass@2": 0.20699551397296437,
1520
+ "pass@4": 0.2408987041359715,
1521
+ "pass@8": 0.271017564290959,
1522
+ "pass@16": 0.29876652985892266,
1523
+ "pass@32": 0.3264747152113181,
1524
+ "pass@64": 0.3549548508714773,
1525
+ "pass@128": 0.38439306358381503
1526
+ },
1527
+ "18": {
1528
+ "pass@1": 0.1616517857142857,
1529
+ "pass@2": 0.19595648200224966,
1530
+ "pass@4": 0.23106420322459695,
1531
+ "pass@8": 0.26888622832851616,
1532
+ "pass@16": 0.3071732341417159,
1533
+ "pass@32": 0.3420263056744111,
1534
+ "pass@64": 0.3714714335108666,
1535
+ "pass@128": 0.39714285714285713
1536
+ },
1537
+ "19": {
1538
+ "pass@1": 0.15040822072072071,
1539
+ "pass@2": 0.1915057281691139,
1540
+ "pass@4": 0.2273034600967671,
1541
+ "pass@8": 0.2598221642368353,
1542
+ "pass@16": 0.2905735616301428,
1543
+ "pass@32": 0.32042698971733125,
1544
+ "pass@64": 0.35124146512270843,
1545
+ "pass@128": 0.3843843843843844
1546
+ },
1547
+ "20": {
1548
+ "pass@1": 0.15052552552552553,
1549
+ "pass@2": 0.18885223806483653,
1550
+ "pass@4": 0.22942468621602477,
1551
+ "pass@8": 0.26979999128432536,
1552
+ "pass@16": 0.30689533475643643,
1553
+ "pass@32": 0.3397581402934893,
1554
+ "pass@64": 0.370991289408613,
1555
+ "pass@128": 0.4024024024024024
1556
+ }
1557
+ }
1558
+ },
1559
+ "teachers_in_school": {
1560
+ "count": 821120,
1561
+ "correct": 483803,
1562
+ "answer_accuracy": 0.5891988990646921,
1563
+ "avg_response_len": 242.341766124318,
1564
+ "resp_tokens_sum": 198991671,
1565
+ "pass_at_k": {
1566
+ "pass@1": 0.5891988990646921,
1567
+ "pass@2": 0.6227068003142299,
1568
+ "pass@4": 0.650107249088997,
1569
+ "pass@8": 0.6741994105674554,
1570
+ "pass@16": 0.6956359816626693,
1571
+ "pass@32": 0.7149534992963424,
1572
+ "pass@64": 0.7325456550204769,
1573
+ "pass@128": 0.7480904130943102
1574
+ },
1575
+ "per_op_pass_at_k": {
1576
+ "10": {
1577
+ "pass@1": 0.8878930214723927,
1578
+ "pass@2": 0.9240016333751996,
1579
+ "pass@4": 0.9427751138086263,
1580
+ "pass@8": 0.9559837013558444,
1581
+ "pass@16": 0.9669589357176731,
1582
+ "pass@32": 0.9757400698048759,
1583
+ "pass@64": 0.9820426961144421,
1584
+ "pass@128": 0.9877300613496932
1585
+ },
1586
+ "2": {
1587
+ "pass@1": 0.9987177051671733,
1588
+ "pass@2": 0.9996395064978579,
1589
+ "pass@4": 0.9999613177152248,
1590
+ "pass@8": 0.9999996232076699,
1591
+ "pass@16": 0.9999999999864305,
1592
+ "pass@32": 1.0,
1593
+ "pass@64": 1.0,
1594
+ "pass@128": 1.0
1595
+ },
1596
+ "3": {
1597
+ "pass@1": 0.9907670454545454,
1598
+ "pass@2": 0.9960163892865667,
1599
+ "pass@4": 0.9985778993534897,
1600
+ "pass@8": 0.9996915957072215,
1601
+ "pass@16": 0.9999815253902985,
1602
+ "pass@32": 0.99999995938087,
1603
+ "pass@64": 0.9999999999999987,
1604
+ "pass@128": 1.0
1605
+ },
1606
+ "4": {
1607
+ "pass@1": 0.9982664571005917,
1608
+ "pass@2": 0.9998132687648512,
1609
+ "pass@4": 0.999995557138198,
1610
+ "pass@8": 0.9999999967656203,
1611
+ "pass@16": 0.9999999999999998,
1612
+ "pass@32": 1.0,
1613
+ "pass@64": 1.0,
1614
+ "pass@128": 1.0
1615
+ },
1616
+ "5": {
1617
+ "pass@1": 0.9600317028985508,
1618
+ "pass@2": 0.9691975493552438,
1619
+ "pass@4": 0.9763000869456536,
1620
+ "pass@8": 0.9828931121195829,
1621
+ "pass@16": 0.989036439654314,
1622
+ "pass@32": 0.9944838684919953,
1623
+ "pass@64": 0.9981653755234783,
1624
+ "pass@128": 1.0
1625
+ },
1626
+ "6": {
1627
+ "pass@1": 0.9823379297994269,
1628
+ "pass@2": 0.9910194058389551,
1629
+ "pass@4": 0.9946701218221643,
1630
+ "pass@8": 0.9965445299840145,
1631
+ "pass@16": 0.9976712739922713,
1632
+ "pass@32": 0.9983904225804993,
1633
+ "pass@64": 0.9992893080034889,
1634
+ "pass@128": 1.0
1635
+ },
1636
+ "7": {
1637
+ "pass@1": 0.9681855130057804,
1638
+ "pass@2": 0.9832421146056164,
1639
+ "pass@4": 0.9914642009163593,
1640
+ "pass@8": 0.9964758591421767,
1641
+ "pass@16": 0.9990374180123244,
1642
+ "pass@32": 0.9999126056145744,
1643
+ "pass@64": 0.9999997855847184,
1644
+ "pass@128": 1.0
1645
+ },
1646
+ "8": {
1647
+ "pass@1": 0.9455765845070423,
1648
+ "pass@2": 0.9712158284351782,
1649
+ "pass@4": 0.9835400283591312,
1650
+ "pass@8": 0.9910390348371692,
1651
+ "pass@16": 0.9955726997113149,
1652
+ "pass@32": 0.9980539040490879,
1653
+ "pass@64": 0.9995741614459047,
1654
+ "pass@128": 1.0
1655
+ },
1656
+ "9": {
1657
+ "pass@1": 0.9390437874251497,
1658
+ "pass@2": 0.9670264539818006,
1659
+ "pass@4": 0.9783135702722786,
1660
+ "pass@8": 0.9840906736702117,
1661
+ "pass@16": 0.9879928605789021,
1662
+ "pass@32": 0.9908513468886352,
1663
+ "pass@64": 0.9930496212972972,
1664
+ "pass@128": 0.9940119760479041
1665
+ },
1666
+ "11": {
1667
+ "pass@1": 0.7309864457831325,
1668
+ "pass@2": 0.811214839673655,
1669
+ "pass@4": 0.8645121111555325,
1670
+ "pass@8": 0.9001346785530842,
1671
+ "pass@16": 0.9235781504819824,
1672
+ "pass@32": 0.939017807747773,
1673
+ "pass@64": 0.9485292552001391,
1674
+ "pass@128": 0.9548192771084337
1675
+ },
1676
+ "12": {
1677
+ "pass@1": 0.439042907523511,
1678
+ "pass@2": 0.5317378835188704,
1679
+ "pass@4": 0.6075961422729685,
1680
+ "pass@8": 0.6683062360710545,
1681
+ "pass@16": 0.7177272091161964,
1682
+ "pass@32": 0.7598458853246168,
1683
+ "pass@64": 0.7942021269976,
1684
+ "pass@128": 0.8213166144200627
1685
+ },
1686
+ "13": {
1687
+ "pass@1": 0.19227065826330533,
1688
+ "pass@2": 0.23474754074858276,
1689
+ "pass@4": 0.27791213361985206,
1690
+ "pass@8": 0.32123267772057174,
1691
+ "pass@16": 0.36547236780648024,
1692
+ "pass@32": 0.412591392141311,
1693
+ "pass@64": 0.46083134462740805,
1694
+ "pass@128": 0.5042016806722689
1695
+ },
1696
+ "14": {
1697
+ "pass@1": 0.20837902046783627,
1698
+ "pass@2": 0.2522699670764838,
1699
+ "pass@4": 0.2964683594923442,
1700
+ "pass@8": 0.3416618793696011,
1701
+ "pass@16": 0.3841336779597966,
1702
+ "pass@32": 0.42375806001618826,
1703
+ "pass@64": 0.4632207639904677,
1704
+ "pass@128": 0.5029239766081871
1705
+ },
1706
+ "15": {
1707
+ "pass@1": 0.19093276515151514,
1708
+ "pass@2": 0.24199400501073715,
1709
+ "pass@4": 0.2891956014020975,
1710
+ "pass@8": 0.3326345282383242,
1711
+ "pass@16": 0.37366558323041027,
1712
+ "pass@32": 0.4128573661812689,
1713
+ "pass@64": 0.4483166035224402,
1714
+ "pass@128": 0.4727272727272727
1715
+ },
1716
+ "16": {
1717
+ "pass@1": 0.15642806267806267,
1718
+ "pass@2": 0.1904278691926329,
1719
+ "pass@4": 0.22132350069489173,
1720
+ "pass@8": 0.251268169779741,
1721
+ "pass@16": 0.28180434335829546,
1722
+ "pass@32": 0.3122157586188942,
1723
+ "pass@64": 0.34211985927431315,
1724
+ "pass@128": 0.3732193732193732
1725
+ },
1726
+ "17": {
1727
+ "pass@1": 0.16779891304347827,
1728
+ "pass@2": 0.21377104526336374,
1729
+ "pass@4": 0.2616579632708955,
1730
+ "pass@8": 0.3084358591590647,
1731
+ "pass@16": 0.3485515761094292,
1732
+ "pass@32": 0.38192734451252297,
1733
+ "pass@64": 0.4116145438455866,
1734
+ "pass@128": 0.43788819875776397
1735
+ },
1736
+ "18": {
1737
+ "pass@1": 0.1518612132352941,
1738
+ "pass@2": 0.1958187384205651,
1739
+ "pass@4": 0.23828893419572547,
1740
+ "pass@8": 0.27971765090438966,
1741
+ "pass@16": 0.31909231139592203,
1742
+ "pass@32": 0.3541919542192856,
1743
+ "pass@64": 0.3879892602530647,
1744
+ "pass@128": 0.4235294117647059
1745
+ },
1746
+ "19": {
1747
+ "pass@1": 0.14004371279761904,
1748
+ "pass@2": 0.17627703763592056,
1749
+ "pass@4": 0.21033196631671033,
1750
+ "pass@8": 0.24241868591666452,
1751
+ "pass@16": 0.2725452956766527,
1752
+ "pass@32": 0.3023936104609465,
1753
+ "pass@64": 0.331830773125219,
1754
+ "pass@128": 0.3601190476190476
1755
+ },
1756
+ "20": {
1757
+ "pass@1": 0.1367421407185629,
1758
+ "pass@2": 0.17771825934744678,
1759
+ "pass@4": 0.21898993408009634,
1760
+ "pass@8": 0.260066157867715,
1761
+ "pass@16": 0.2989195527821917,
1762
+ "pass@32": 0.3334463816599912,
1763
+ "pass@64": 0.3631967204725436,
1764
+ "pass@128": 0.38622754491017963
1765
+ }
1766
+ }
1767
+ },
1768
+ "movie_festival_awards": {
1769
+ "count": 801152,
1770
+ "correct": 471310,
1771
+ "answer_accuracy": 0.5882903618788944,
1772
+ "avg_response_len": 258.0569242790382,
1773
+ "resp_tokens_sum": 206742821,
1774
+ "pass_at_k": {
1775
+ "pass@1": 0.5882903618788944,
1776
+ "pass@2": 0.6212685622467475,
1777
+ "pass@4": 0.6478714293112718,
1778
+ "pass@8": 0.6709216329453926,
1779
+ "pass@16": 0.6916683710373397,
1780
+ "pass@32": 0.7106315691905148,
1781
+ "pass@64": 0.728740000525103,
1782
+ "pass@128": 0.7466048889598977
1783
+ },
1784
+ "per_op_pass_at_k": {
1785
+ "10": {
1786
+ "pass@1": 0.8983908582089553,
1787
+ "pass@2": 0.9340132506757551,
1788
+ "pass@4": 0.9530650669599133,
1789
+ "pass@8": 0.9652801856862316,
1790
+ "pass@16": 0.9745693164759084,
1791
+ "pass@32": 0.9818461284408381,
1792
+ "pass@64": 0.9871328155545905,
1793
+ "pass@128": 0.991044776119403
1794
+ },
1795
+ "2": {
1796
+ "pass@1": 0.9998660714285714,
1797
+ "pass@2": 0.9999992969628797,
1798
+ "pass@4": 1.0,
1799
+ "pass@8": 1.0,
1800
+ "pass@16": 1.0,
1801
+ "pass@32": 1.0,
1802
+ "pass@64": 1.0,
1803
+ "pass@128": 1.0
1804
+ },
1805
+ "3": {
1806
+ "pass@1": 0.9907069970845481,
1807
+ "pass@2": 0.9970669721769471,
1808
+ "pass@4": 0.9993563997831172,
1809
+ "pass@8": 0.9999514691641971,
1810
+ "pass@16": 0.9999996499221584,
1811
+ "pass@32": 0.9999999999946803,
1812
+ "pass@64": 1.0,
1813
+ "pass@128": 1.0
1814
+ },
1815
+ "4": {
1816
+ "pass@1": 0.9958196271929824,
1817
+ "pass@2": 0.9977559342911083,
1818
+ "pass@4": 0.9990211264710332,
1819
+ "pass@8": 0.9997663197662907,
1820
+ "pass@16": 0.9999856176620865,
1821
+ "pass@32": 0.9999999740185194,
1822
+ "pass@64": 0.9999999999999997,
1823
+ "pass@128": 1.0
1824
+ },
1825
+ "5": {
1826
+ "pass@1": 0.9548943014705882,
1827
+ "pass@2": 0.9685079174386292,
1828
+ "pass@4": 0.9784535587463331,
1829
+ "pass@8": 0.9869293632462349,
1830
+ "pass@16": 0.9934826994856965,
1831
+ "pass@32": 0.9970310861536675,
1832
+ "pass@64": 0.9985066302193607,
1833
+ "pass@128": 1.0
1834
+ },
1835
+ "6": {
1836
+ "pass@1": 0.9775483044164038,
1837
+ "pass@2": 0.9891293716932861,
1838
+ "pass@4": 0.9959075131381444,
1839
+ "pass@8": 0.9991092920224423,
1840
+ "pass@16": 0.9999352040645176,
1841
+ "pass@32": 0.9999996662295679,
1842
+ "pass@64": 0.9999999999997393,
1843
+ "pass@128": 1.0
1844
+ },
1845
+ "7": {
1846
+ "pass@1": 0.9744115901898734,
1847
+ "pass@2": 0.9867456923402768,
1848
+ "pass@4": 0.9912063912303684,
1849
+ "pass@8": 0.9935151685224703,
1850
+ "pass@16": 0.9954041601504686,
1851
+ "pass@32": 0.9972167179953726,
1852
+ "pass@64": 0.9990265956175474,
1853
+ "pass@128": 1.0
1854
+ },
1855
+ "8": {
1856
+ "pass@1": 0.9515531156156156,
1857
+ "pass@2": 0.9743795222387744,
1858
+ "pass@4": 0.9852795146102226,
1859
+ "pass@8": 0.9899431408996546,
1860
+ "pass@16": 0.9925003924523234,
1861
+ "pass@32": 0.9950902478963195,
1862
+ "pass@64": 0.9976578190444356,
1863
+ "pass@128": 1.0
1864
+ },
1865
+ "9": {
1866
+ "pass@1": 0.9379521704180064,
1867
+ "pass@2": 0.9720591912043948,
1868
+ "pass@4": 0.9858236963747689,
1869
+ "pass@8": 0.9932814976290268,
1870
+ "pass@16": 0.9968830356374474,
1871
+ "pass@32": 0.9981725276413653,
1872
+ "pass@64": 0.9992024635603814,
1873
+ "pass@128": 1.0
1874
+ },
1875
+ "11": {
1876
+ "pass@1": 0.7650545634920635,
1877
+ "pass@2": 0.8431473409573809,
1878
+ "pass@4": 0.8877359086066622,
1879
+ "pass@8": 0.9145899228176874,
1880
+ "pass@16": 0.9334761048558039,
1881
+ "pass@32": 0.9489928847114996,
1882
+ "pass@64": 0.9642010673849086,
1883
+ "pass@128": 0.9809523809523809
1884
+ },
1885
+ "12": {
1886
+ "pass@1": 0.4054615825688073,
1887
+ "pass@2": 0.49957183534397626,
1888
+ "pass@4": 0.5768995727598271,
1889
+ "pass@8": 0.639355428961972,
1890
+ "pass@16": 0.6924379381519293,
1891
+ "pass@32": 0.7389465717283092,
1892
+ "pass@64": 0.7833812207530858,
1893
+ "pass@128": 0.8256880733944955
1894
+ },
1895
+ "13": {
1896
+ "pass@1": 0.21805073302469136,
1897
+ "pass@2": 0.2614506628511712,
1898
+ "pass@4": 0.30509700407819385,
1899
+ "pass@8": 0.34843945585161107,
1900
+ "pass@16": 0.3892841325907048,
1901
+ "pass@32": 0.42707158587605587,
1902
+ "pass@64": 0.46350207303509766,
1903
+ "pass@128": 0.5
1904
+ },
1905
+ "14": {
1906
+ "pass@1": 0.19482566765578635,
1907
+ "pass@2": 0.24658432440010283,
1908
+ "pass@4": 0.2977982871762691,
1909
+ "pass@8": 0.34710213397424194,
1910
+ "pass@16": 0.39305571585018145,
1911
+ "pass@32": 0.4354334888531387,
1912
+ "pass@64": 0.4760201367236571,
1913
+ "pass@128": 0.516320474777448
1914
+ },
1915
+ "15": {
1916
+ "pass@1": 0.15263310185185186,
1917
+ "pass@2": 0.20657633724603863,
1918
+ "pass@4": 0.26040626837154607,
1919
+ "pass@8": 0.31066200412971595,
1920
+ "pass@16": 0.3575054534309268,
1921
+ "pass@32": 0.40181191645690056,
1922
+ "pass@64": 0.4460592807353767,
1923
+ "pass@128": 0.49074074074074076
1924
+ },
1925
+ "16": {
1926
+ "pass@1": 0.15613477138643067,
1927
+ "pass@2": 0.1863313822497852,
1928
+ "pass@4": 0.21273005393131156,
1929
+ "pass@8": 0.23865739981671225,
1930
+ "pass@16": 0.2676959517806339,
1931
+ "pass@32": 0.3011310763692745,
1932
+ "pass@64": 0.3362407178410369,
1933
+ "pass@128": 0.37168141592920356
1934
+ },
1935
+ "17": {
1936
+ "pass@1": 0.1378012048192771,
1937
+ "pass@2": 0.17401619272365054,
1938
+ "pass@4": 0.21124320340981484,
1939
+ "pass@8": 0.24873698272180494,
1940
+ "pass@16": 0.28468715470923933,
1941
+ "pass@32": 0.31684364503562207,
1942
+ "pass@64": 0.3460631347877581,
1943
+ "pass@128": 0.37650602409638556
1944
+ },
1945
+ "18": {
1946
+ "pass@1": 0.16940524193548387,
1947
+ "pass@2": 0.20793830962661927,
1948
+ "pass@4": 0.24442847708552565,
1949
+ "pass@8": 0.27696590133200394,
1950
+ "pass@16": 0.3047539165997947,
1951
+ "pass@32": 0.329140512075412,
1952
+ "pass@64": 0.3525837977178592,
1953
+ "pass@128": 0.3774193548387097
1954
+ },
1955
+ "19": {
1956
+ "pass@1": 0.13812311178247735,
1957
+ "pass@2": 0.16759305790137258,
1958
+ "pass@4": 0.19758813267676884,
1959
+ "pass@8": 0.2284911534750015,
1960
+ "pass@16": 0.2603002938970115,
1961
+ "pass@32": 0.29249731296624343,
1962
+ "pass@64": 0.3260332407783438,
1963
+ "pass@128": 0.36253776435045315
1964
+ },
1965
+ "20": {
1966
+ "pass@1": 0.13072447447447447,
1967
+ "pass@2": 0.16936079780567978,
1968
+ "pass@4": 0.2094049935762534,
1969
+ "pass@8": 0.25098284669882537,
1970
+ "pass@16": 0.291417867530838,
1971
+ "pass@32": 0.32739810777438544,
1972
+ "pass@64": 0.35806392636422313,
1973
+ "pass@128": 0.3813813813813814
1974
+ }
1975
+ }
1976
+ }
1977
+ }
1978
+ }
1979
+ }