Tristan commited on
Commit
a343b43
1 Parent(s): 792a71b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -53
README.md CHANGED
@@ -65,59 +65,28 @@ The model was trained according to the OLM GPT2 instructions at this [repo](http
65
 
66
  The model achieves the following results without any fine-tuning (zero-shot):
67
 
68
- | Task |Version| Metric |Value | |Stderr|
69
- |--------------|------:|----------------|-----:|---|------|
70
- |webqs | 0|acc_p_value |0.0000| | |
71
- |triviaqa | 1|acc_p_value |0.0088| | |
72
- |arc_easy | 0|acc_p_value |0.0022| | |
73
- | | |acc_norm_p_value|0.0049| | |
74
- |arc_challenge | 0|acc_p_value |0.1017| | |
75
- | | |acc_norm_p_value|0.2957| | |
76
- |copa | 0|acc_p_value |0.4070| | |
77
- |qnli | 0|acc_p_value |0.2913| | |
78
- |lambada_openai| 0|ppl_p_value |0.0000| | |
79
- | | |acc_p_value |0.0000| | |
80
- |mrpc | 0|acc_p_value |0.0000| | |
81
- | | |f1_p_value |0.0000| | |
82
- |wsc | 0|acc_p_value |0.1680| | |
83
- |winogrande | 0|acc_p_value |0.4314| | |
84
- |hellaswag | 0|acc_p_value |0.0000| | |
85
- | | |acc_norm_p_value|0.0000| | |
86
- |rte | 0|acc_p_value |0.7184| | |
87
- |mnli | 0|acc_p_value |0.0071| | |
88
- |multirc | 1|acc_p_value |0.4755| | |
89
- |cb | 1|acc_p_value |0.2816| | |
90
- |boolq | 1|acc_p_value |0.0000| | |
91
- |wic | 0|acc_p_value |0.6924| | |
92
- |piqa | 0|acc_p_value |0.0004| | |
93
- | | |acc_norm_p_value|0.0003| | |
94
- |cola | 0|mcc_p_value |0.6880| | |
95
- |record | 0|f1_p_value |0.0000| | |
96
- | | |em_p_value |0.0000| | |
97
-
98
-
99
- | Task | Metric | Original GPT2 | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
100
- |:------------|:-----------|--------------------:|-------------------------:|----------------------------------:|
101
- |rte |acc |0.5307 |0.5199 |0.7184 |
102
- |piqa |acc/acc_norm|0.6289/0.6251 |**0.6692**/**0.6665** |**0.0004**/**0.0003** |
103
- |copa |acc |0.6400 |0.6800 |0.4070 |
104
- |record |f1/em |**0.7094**/**0.7026**|0.6884/0.6818 |**0.0000**/**0.0000** |
105
- |boolq |acc |0.4872 |**0.6021** |**0.0000** |
106
- |cb |acc/f1 |0.4107/0.2619 |0.3393/0.1840 |0.2816/NA |
107
- |hellaswag |acc/acc_norm|0.2892/0.3114 |**0.3079**/**0.3482** |**0.0000**/**0.0000** |
108
- |mrpc |acc/f1 |0.5662/0.6911 |**0.6814**/**0.8099** |**0.0000**/**0.0000** |
109
- |multirc |acc |0.0189 |0.0220 |0.4755 |
110
- |lambada |ppl/acc |40.0554/0.3256 |**28.3359**/**0.3699** |**0.0000**/**0.0000** |
111
- |wsc |acc |0.4327 |0.3654 |0.1680 |
112
- |wic |acc |0.4922 |0.5000 |0.6924 |
113
- |mnli |acc |0.3372 |**0.3501** |**0.0071** |
114
- |qnli |acc |0.5017 |0.4946 |0.2913 |
115
- |cola |mcc |0.0126 |0.0000 |0.6880 |
116
- |triviaqa |acc |0.0151 |**0.0181** |**0.0088** |
117
- |winogrande |acc |0.5162 |0.5051 |0.4314 |
118
- |webqs |acc |0.0030 |**0.0079** |**0.0000** |
119
- |arc_easy |acc/acc_norm|0.4381/0.3948 |**0.4693**/**0.4230** |**0.0022**/**0.0049** |
120
- |arc_challenge|acc/acc_norm|0.1903/0.2270 |0.2090/0.2398 |0.1017/0.2957 |
121
 
122
  To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
123
  which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.
65
 
66
  The model achieves the following results without any fine-tuning (zero-shot):
67
 
68
+ | Task | Version | Metric | Original GPT2 | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
69
+ |:------------|:--------|:-----------|--------------------:|-------------------------:|----------------------------------:|
70
+ |rte |0 |acc |0.5307 |0.5199 |0.7184 |
71
+ |piqa |0 |acc/acc_norm|0.6289/0.6251 |**0.6692**/**0.6665** |**0.0004**/**0.0003** |
72
+ |copa |0 |acc |0.6400 |0.6800 |0.4070 |
73
+ |record |0 |f1/em |**0.7094**/**0.7026**|0.6884/0.6818 |**0.0000**/**0.0000** |
74
+ |boolq |1 |acc |0.4872 |**0.6021** |**0.0000** |
75
+ |cb |1 |acc/f1 |0.4107/0.2619 |0.3393/0.1840 |0.2816/NA |
76
+ |hellaswag |0 |acc/acc_norm|0.2892/0.3114 |**0.3079**/**0.3482** |**0.0000**/**0.0000** |
77
+ |mrpc |0 |acc/f1 |0.5662/0.6911 |**0.6814**/**0.8099** |**0.0000**/**0.0000** |
78
+ |multirc |1 |acc |0.0189 |0.0220 |0.4755 |
79
+ |lambada |0 |ppl/acc |40.0554/0.3256 |**28.3359**/**0.3699** |**0.0000**/**0.0000** |
80
+ |wsc |0 |acc |0.4327 |0.3654 |0.1680 |
81
+ |wic |0 |acc |0.4922 |0.5000 |0.6924 |
82
+ |mnli |0 |acc |0.3372 |**0.3501** |**0.0071** |
83
+ |qnli |0 |acc |0.5017 |0.4946 |0.2913 |
84
+ |cola |0 |mcc |0.0126 |0.0000 |0.6880 |
85
+ |triviaqa |1 |acc |0.0151 |**0.0181** |**0.0088** |
86
+ |winogrande |0 |acc |0.5162 |0.5051 |0.4314 |
87
+ |webqs |0 |acc |0.0030 |**0.0079** |**0.0000** |
88
+ |arc_easy |0 |acc/acc_norm|0.4381/0.3948 |**0.4693**/**0.4230** |**0.0022**/**0.0049** |
89
+ |arc_challenge|0 |acc/acc_norm|0.1903/0.2270 |0.2090/0.2398 |0.1017/0.2957 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
92
  which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.