crumb
/

distilpythia

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

crumb commited on May 4, 2023

Commit

c023524

·

1 Parent(s): 5c487bd

Update README.md

Files changed (1) hide show

README.md +6 -5

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ language:
 *by GPT-4 & Crumb*
-***Note***: *this version of the model was not trained with a constant-length dataset. it is in the process of being retrained right now.*
 ### Introduction
@@ -30,12 +30,13 @@ A learning rate of 1e-4 was used in this study, with no learning rate schedule.
 [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) suggests a student around 40% of the size of it's teacher can achieve similar performance in encoder models when training from scratch with suprivision. We warm-start our model from a smaller checkpoint than the teacher that maintains a similar ratio with a student that is 43.75% the size of it's teacher.
-| model | piqa acc | winogrande acc | lambada ppl | lambada acc | arc acc | sciq acc | wsc acc |
-| --- | --- | --- | --- | --- | --- | --- | --- |
 | pythia-70m (student base) | 59.85 | 51.22 | 140.81 | 21.40 | 17.15 | 65.00 | 36.53 |
 | pythia-160m (teacher) | 62.68 | 51.07 | 30.03 | 36.76 | 19.62 | 76.20 | 36.58 |
-| --- | --- | --- | --- | --- | --- | --- | --- |
-| distilpythia (student) | 59.74 | **51.62** | 420.70 | 15.82 | **17.15** | 61.30 | **36.54** |
 <center> <i>Table 1.</i> The student before finetuning, teacher, and student after finetuning and their results on various benchmarks. Numbers in bold are where the student after finetuning matches or outperforms the student before finetuning. </center>

 *by GPT-4 & Crumb*
+***Note***: *this model is in the process of being re-evaluated because it was retrained.*
 ### Introduction
 [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) suggests a student around 40% of the size of it's teacher can achieve similar performance in encoder models when training from scratch with suprivision. We warm-start our model from a smaller checkpoint than the teacher that maintains a similar ratio with a student that is 43.75% the size of it's teacher.
+| model | piqa acc | winogrande acc | lambada ppl | lambada acc | arc acc | sciq acc | wsc acc | notes |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | pythia-70m (student base) | 59.85 | 51.22 | 140.81 | 21.40 | 17.15 | 65.00 | 36.53 |
 | pythia-160m (teacher) | 62.68 | 51.07 | 30.03 | 36.76 | 19.62 | 76.20 | 36.58 |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| distilpythia (student) | 59.74 | **51.62** | 420.70 | 15.82 | **17.15** | 61.30 | **36.54** | trained on padded/truncated examples
+| distilpythia-cl (student) | 59.30 | 50.75 | 403.78 | 15.16 | 16.98 | 59.20 | **36.54** | trained on a constant-length dataset
 <center> <i>Table 1.</i> The student before finetuning, teacher, and student after finetuning and their results on various benchmarks. Numbers in bold are where the student after finetuning matches or outperforms the student before finetuning. </center>