crumb commited on
Commit
57ed734
1 Parent(s): bfe8bf1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -30,12 +30,13 @@ A learning rate of 1e-4 was used in this study, with no learning rate schedule.
30
 
31
  [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) suggests a student around 40% of the size of it's teacher can achieve similar performance in encoder models when training from scratch with suprivision. We warm-start our model from a smaller checkpoint than the teacher that maintains a similar ratio with a student that is 43.75% the size of it's teacher.
32
 
33
- | model | piqa acc | winogrande acc | lambada ppl | lambada acc | arc acc | sciq acc | wsc acc |
34
- | --- | --- | --- | --- | --- | --- | --- | --- |
35
  | pythia-70m (student base) | 59.85 | 51.22 | 140.81 | 21.40 | 17.15 | 65.00 | 36.53 |
36
  | pythia-160m (teacher) | 62.68 | 51.07 | 30.03 | 36.76 | 19.62 | 76.20 | 36.58 |
37
- | --- | --- | --- | --- | --- | --- | --- | --- |
38
- | distilpythia (student) | 59.74 | **51.62** | 420.70 | 15.82 | **17.15** | 61.30 | **36.54** |
 
39
 
40
  <center> <i>Table 1.</i> The student before finetuning, teacher, and student after finetuning and their results on various benchmarks. Numbers in bold are where the student after finetuning matches or outperforms the student before finetuning. </center>
41
 
 
30
 
31
  [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) suggests a student around 40% of the size of it's teacher can achieve similar performance in encoder models when training from scratch with suprivision. We warm-start our model from a smaller checkpoint than the teacher that maintains a similar ratio with a student that is 43.75% the size of it's teacher.
32
 
33
+ | model | piqa acc | winogrande acc | lambada ppl | lambada acc | arc acc | sciq acc | wsc acc | notes |
34
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- |
35
  | pythia-70m (student base) | 59.85 | 51.22 | 140.81 | 21.40 | 17.15 | 65.00 | 36.53 |
36
  | pythia-160m (teacher) | 62.68 | 51.07 | 30.03 | 36.76 | 19.62 | 76.20 | 36.58 |
37
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- |
38
+ | distilpythia (student) | 59.74 | **51.62** | 420.70 | 15.82 | **17.15** | 61.30 | **36.54** | trained on padded/truncated examples
39
+ | distilpythia-cl (student) | 59.30 | 50.75 | 403.78 | 15.16 | 16.98 | 59.20 | **36.54** | trained on a constant-length dataset
40
 
41
  <center> <i>Table 1.</i> The student before finetuning, teacher, and student after finetuning and their results on various benchmarks. Numbers in bold are where the student after finetuning matches or outperforms the student before finetuning. </center>
42