File size: 10,514 Bytes
44fd161
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
dataset,prompt,metric,value
anli_dev_r1,GPT-3 style,accuracy,0.497
anli_dev_r1,MNLI crowdsource,accuracy,0.442
anli_dev_r1,can we infer,accuracy,0.456
anli_dev_r1,guaranteed/possible/impossible,accuracy,0.328
anli_dev_r1,justified in saying,accuracy,0.46
anli_dev_r1,median,accuracy,0.456
anli_dev_r2,GPT-3 style,accuracy,0.45
anli_dev_r2,MNLI crowdsource,accuracy,0.382
anli_dev_r2,can we infer,accuracy,0.419
anli_dev_r2,guaranteed/possible/impossible,accuracy,0.345
anli_dev_r2,justified in saying,accuracy,0.41
anli_dev_r2,median,accuracy,0.41
anli_dev_r3,GPT-3 style,accuracy,0.4558333333333333
anli_dev_r3,MNLI crowdsource,accuracy,0.41333333333333333
anli_dev_r3,can we infer,accuracy,0.4225
anli_dev_r3,guaranteed/possible/impossible,accuracy,0.305
anli_dev_r3,justified in saying,accuracy,0.4083333333333333
anli_dev_r3,median,accuracy,0.41333333333333333
story_cloze_2016,Answer Given options,accuracy,0.9524318546231961
story_cloze_2016,Choose Story Ending,accuracy,0.9668626402993051
story_cloze_2016,Generate Ending,accuracy,0.7760555852485302
story_cloze_2016,Novel Correct Ending,accuracy,0.9583110636023516
story_cloze_2016,Story Continuation and Options,accuracy,0.9593800106894709
story_cloze_2016,median,accuracy,0.9583110636023516
super_glue_cb,GPT-3 style,accuracy,0.875
super_glue_cb,MNLI crowdsource,accuracy,0.35714285714285715
super_glue_cb,can we infer,accuracy,0.75
super_glue_cb,guaranteed/possible/impossible,accuracy,0.7678571428571429
super_glue_cb,justified in saying,accuracy,0.8035714285714286
super_glue_cb,median,accuracy,0.7678571428571429
super_glue_copa,"C1 or C2? premise, so/because…",accuracy,0.75
super_glue_copa,best_option,accuracy,0.87
super_glue_copa,cause_effect,accuracy,0.9
super_glue_copa,i_am_hesitating,accuracy,0.91
super_glue_copa,plausible_alternatives,accuracy,0.91
super_glue_copa,median,accuracy,0.9
super_glue_rte,GPT-3 style,accuracy,0.7870036101083032
super_glue_rte,MNLI crowdsource,accuracy,0.8592057761732852
super_glue_rte,does it follow that,accuracy,0.8194945848375451
super_glue_rte,guaranteed true,accuracy,0.7942238267148014
super_glue_rte,should assume,accuracy,0.8122743682310469
super_glue_rte,median,accuracy,0.8122743682310469
winogrande_winogrande_xl,Replace,accuracy,0.5998421468034728
winogrande_winogrande_xl,True or False,accuracy,0.5359116022099447
winogrande_winogrande_xl,does underscore refer to,accuracy,0.5864246250986582
winogrande_winogrande_xl,stand for,accuracy,0.5201262825572218
winogrande_winogrande_xl,underscore refer to,accuracy,0.5880031570639306
winogrande_winogrande_xl,median,accuracy,0.5864246250986582
xcopa_id,"C1 or C2? premise, so/because…",accuracy,0.56
xcopa_id,best_option,accuracy,0.81
xcopa_id,cause_effect,accuracy,0.87
xcopa_id,i_am_hesitating,accuracy,0.83
xcopa_id,plausible_alternatives,accuracy,0.87
xcopa_id,median,accuracy,0.83
xcopa_sw,"C1 or C2? premise, so/because…",accuracy,0.6
xcopa_sw,best_option,accuracy,0.62
xcopa_sw,cause_effect,accuracy,0.64
xcopa_sw,i_am_hesitating,accuracy,0.66
xcopa_sw,plausible_alternatives,accuracy,0.64
xcopa_sw,median,accuracy,0.64
xcopa_ta,"C1 or C2? premise, so/because…",accuracy,0.59
xcopa_ta,best_option,accuracy,0.66
xcopa_ta,cause_effect,accuracy,0.7
xcopa_ta,i_am_hesitating,accuracy,0.69
xcopa_ta,plausible_alternatives,accuracy,0.64
xcopa_ta,median,accuracy,0.66
xcopa_vi,"C1 or C2? premise, so/because…",accuracy,0.58
xcopa_vi,best_option,accuracy,0.81
xcopa_vi,cause_effect,accuracy,0.91
xcopa_vi,i_am_hesitating,accuracy,0.85
xcopa_vi,plausible_alternatives,accuracy,0.84
xcopa_vi,median,accuracy,0.84
xcopa_zh,"C1 or C2? premise, so/because…",accuracy,0.57
xcopa_zh,best_option,accuracy,0.84
xcopa_zh,cause_effect,accuracy,0.86
xcopa_zh,i_am_hesitating,accuracy,0.86
xcopa_zh,plausible_alternatives,accuracy,0.81
xcopa_zh,median,accuracy,0.84
xnli_ar,GPT-3 style,accuracy,0.5578313253012048
xnli_ar,MNLI crowdsource,accuracy,0.41164658634538154
xnli_ar,can we infer,accuracy,0.5152610441767068
xnli_ar,guaranteed/possible/impossible,accuracy,0.5803212851405622
xnli_ar,justified in saying,accuracy,0.5184738955823294
xnli_ar,median,accuracy,0.5184738955823294
xnli_en,GPT-3 style,accuracy,0.6176706827309237
xnli_en,MNLI crowdsource,accuracy,0.4606425702811245
xnli_en,can we infer,accuracy,0.5714859437751004
xnli_en,guaranteed/possible/impossible,accuracy,0.6180722891566265
xnli_en,justified in saying,accuracy,0.5746987951807229
xnli_en,median,accuracy,0.5746987951807229
xnli_es,GPT-3 style,accuracy,0.5911646586345382
xnli_es,MNLI crowdsource,accuracy,0.43052208835341366
xnli_es,can we infer,accuracy,0.4397590361445783
xnli_es,guaranteed/possible/impossible,accuracy,0.5208835341365462
xnli_es,justified in saying,accuracy,0.41726907630522087
xnli_es,median,accuracy,0.4397590361445783
xnli_fr,GPT-3 style,accuracy,0.5911646586345382
xnli_fr,MNLI crowdsource,accuracy,0.4321285140562249
xnli_fr,can we infer,accuracy,0.5369477911646586
xnli_fr,guaranteed/possible/impossible,accuracy,0.5176706827309236
xnli_fr,justified in saying,accuracy,0.5385542168674698
xnli_fr,median,accuracy,0.5369477911646586
xnli_hi,GPT-3 style,accuracy,0.5208835341365462
xnli_hi,MNLI crowdsource,accuracy,0.3819277108433735
xnli_hi,can we infer,accuracy,0.44176706827309237
xnli_hi,guaranteed/possible/impossible,accuracy,0.5253012048192771
xnli_hi,justified in saying,accuracy,0.44377510040160645
xnli_hi,median,accuracy,0.44377510040160645
xnli_sw,GPT-3 style,accuracy,0.5036144578313253
xnli_sw,MNLI crowdsource,accuracy,0.3887550200803213
xnli_sw,can we infer,accuracy,0.44216867469879517
xnli_sw,guaranteed/possible/impossible,accuracy,0.38795180722891565
xnli_sw,justified in saying,accuracy,0.4397590361445783
xnli_sw,median,accuracy,0.4397590361445783
xnli_ur,GPT-3 style,accuracy,0.4907630522088353
xnli_ur,MNLI crowdsource,accuracy,0.37309236947791163
xnli_ur,can we infer,accuracy,0.45863453815261046
xnli_ur,guaranteed/possible/impossible,accuracy,0.5124497991967871
xnli_ur,justified in saying,accuracy,0.45582329317269077
xnli_ur,median,accuracy,0.45863453815261046
xnli_vi,GPT-3 style,accuracy,0.5582329317269076
xnli_vi,MNLI crowdsource,accuracy,0.42690763052208835
xnli_vi,can we infer,accuracy,0.4759036144578313
xnli_vi,guaranteed/possible/impossible,accuracy,0.5008032128514056
xnli_vi,justified in saying,accuracy,0.4827309236947791
xnli_vi,median,accuracy,0.4827309236947791
xnli_zh,GPT-3 style,accuracy,0.5550200803212851
xnli_zh,MNLI crowdsource,accuracy,0.4248995983935743
xnli_zh,can we infer,accuracy,0.43052208835341366
xnli_zh,guaranteed/possible/impossible,accuracy,0.5526104417670683
xnli_zh,justified in saying,accuracy,0.44016064257028115
xnli_zh,median,accuracy,0.44016064257028115
xstory_cloze_ar,Answer Given options,accuracy,0.7835870284579749
xstory_cloze_ar,Choose Story Ending,accuracy,0.9291859695565851
xstory_cloze_ar,Generate Ending,accuracy,0.6624751819986764
xstory_cloze_ar,Novel Correct Ending,accuracy,0.9252150893448048
xstory_cloze_ar,Story Continuation and Options,accuracy,0.9159497021839841
xstory_cloze_ar,median,accuracy,0.9159497021839841
xstory_cloze_es,Answer Given options,accuracy,0.870946393117141
xstory_cloze_es,Choose Story Ending,accuracy,0.9523494374586366
xstory_cloze_es,Generate Ending,accuracy,0.7319655857048313
xstory_cloze_es,Novel Correct Ending,accuracy,0.9477167438782264
xstory_cloze_es,Story Continuation and Options,accuracy,0.9516876240900066
xstory_cloze_es,median,accuracy,0.9477167438782264
xstory_cloze_eu,Answer Given options,accuracy,0.6982131039046989
xstory_cloze_eu,Choose Story Ending,accuracy,0.85704831237591
xstory_cloze_eu,Generate Ending,accuracy,0.614162806088683
xstory_cloze_eu,Novel Correct Ending,accuracy,0.8590337524818001
xstory_cloze_eu,Story Continuation and Options,accuracy,0.8504301786896096
xstory_cloze_eu,median,accuracy,0.8504301786896096
xstory_cloze_hi,Answer Given options,accuracy,0.7683653209794837
xstory_cloze_hi,Choose Story Ending,accuracy,0.8742554599602912
xstory_cloze_hi,Generate Ending,accuracy,0.657180675049636
xstory_cloze_hi,Novel Correct Ending,accuracy,0.886829913964262
xstory_cloze_hi,Story Continuation and Options,accuracy,0.8762409000661814
xstory_cloze_hi,median,accuracy,0.8742554599602912
xstory_cloze_id,Answer Given options,accuracy,0.8332230311052283
xstory_cloze_id,Choose Story Ending,accuracy,0.913964262078094
xstory_cloze_id,Generate Ending,accuracy,0.700198544010589
xstory_cloze_id,Novel Correct Ending,accuracy,0.9205823957643945
xstory_cloze_id,Story Continuation and Options,accuracy,0.9086697551290536
xstory_cloze_id,median,accuracy,0.9086697551290536
xstory_cloze_zh,Answer Given options,accuracy,0.870946393117141
xstory_cloze_zh,Choose Story Ending,accuracy,0.9265387160820648
xstory_cloze_zh,Generate Ending,accuracy,0.6823295830575777
xstory_cloze_zh,Novel Correct Ending,accuracy,0.928524156187955
xstory_cloze_zh,Story Continuation and Options,accuracy,0.9232296492389146
xstory_cloze_zh,median,accuracy,0.9232296492389146
xwinograd_en,Replace,accuracy,0.6933333333333334
xwinograd_en,True or False,accuracy,0.5212903225806451
xwinograd_en,does underscore refer to,accuracy,0.6563440860215054
xwinograd_en,stand for,accuracy,0.5156989247311828
xwinograd_en,underscore refer to,accuracy,0.6473118279569893
xwinograd_en,median,accuracy,0.6473118279569893
xwinograd_fr,Replace,accuracy,0.6024096385542169
xwinograd_fr,True or False,accuracy,0.46987951807228917
xwinograd_fr,does underscore refer to,accuracy,0.5903614457831325
xwinograd_fr,stand for,accuracy,0.4939759036144578
xwinograd_fr,underscore refer to,accuracy,0.6867469879518072
xwinograd_fr,median,accuracy,0.5903614457831325
xwinograd_pt,Replace,accuracy,0.6463878326996197
xwinograd_pt,True or False,accuracy,0.5285171102661597
xwinograd_pt,does underscore refer to,accuracy,0.6007604562737643
xwinograd_pt,stand for,accuracy,0.49809885931558934
xwinograd_pt,underscore refer to,accuracy,0.6083650190114068
xwinograd_pt,median,accuracy,0.6007604562737643
xwinograd_zh,Replace,accuracy,0.7063492063492064
xwinograd_zh,True or False,accuracy,0.5515873015873016
xwinograd_zh,does underscore refer to,accuracy,0.621031746031746
xwinograd_zh,stand for,accuracy,0.5158730158730159
xwinograd_zh,underscore refer to,accuracy,0.6765873015873016
xwinograd_zh,median,accuracy,0.621031746031746
multiple,average,multiple,0.6665267892901372