distily_bench_gpt2_optim

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

Peak GPU Memory: 4.9635 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2385	57.2728					18.1772
0	0	55339.3672	57682.5742	331776.0	21.8237	45.822	11.455	57080.2930
500	0.0808	2141.9436	10202.3701	12330.6885	21.7635	45.948	11.487	53877.0430
1000	0.1616	1548.9375	5687.2700	10414.7197	21.7598	45.956	11.489	18884.9902
1500	0.2424	1210.3187	5229.9336	9469.8877	21.7665	45.942	11.486	13649.9746
2000	0.3232	1043.7686	5214.2856	8923.7764	21.649	46.191	11.548	16504.4258
2500	0.4040	918.9057	4731.1772	8583.6797	21.7074	46.067	11.517	16631.6348
3000	0.4848	835.4744	4334.6509	8223.8076	21.8184	45.833	11.458	11922.9453
3500	0.5657	767.9663	4349.3467	8085.1519	21.829	45.811	11.453	14098.2949
4000	0.6465	713.7677	4238.2466	7742.4639	21.8774	45.709	11.427	15250.9297
4500	0.7273	665.8071	3945.1226	7548.7358	21.7762	45.922	11.48	11118.6543
5000	0.8081	625.8375	3838.6619	7384.2559	21.827	45.815	11.454	7372.2939
5500	0.8889	599.3104	3789.3154	7218.8159	21.7296	46.02	11.505	5835.5698
6000	0.9697	571.8722	3735.3345	7106.2402	21.667	46.153	11.538	5088.3940
6187	0.9999	557.4904	3720.3530	7067.3281	21.809	45.853	11.463	4680.3271