distily_bench_gpt2_optim_extended2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 1466.9598
eval_frwikippl: 6589.9976
eval_zhwikippl: 19049.6328
eval_loss: 8530.3359
eval_runtime: 64.7254
eval_samples_per_second: 46.35
eval_steps_per_second: 11.587

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: 'legacy'
loss_fn: kl
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.3354 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2385	57.2728					18.1772
0	0	55332.9297	57511.9648	333834.9375	64.4894	46.519	11.63	57797.4375
500	0.0269	3397.8057	14195.7314	11200.1709	64.3161	46.645	11.661	46176.3906
1000	0.0539	2565.4185	11100.7803	10401.7070	64.9732	46.173	11.543	40786.25
1500	0.0808	2280.1555	9752.9180	10029.2695	65.1147	46.073	11.518	34300.0664
2000	0.1077	2111.7202	8617.1777	9861.6855	65.0861	46.093	11.523	27128.5918
2500	0.1347	1990.7386	8209.1553	9601.2373	64.8934	46.23	11.557	25209.2168
3000	0.1616	1918.3867	7799.5220	9467.9785	64.886	46.235	11.559	22736.8027
3500	0.1886	1818.1265	7551.1548	9349.7920	64.7154	46.357	11.589	22582.4883
4000	0.2155	1769.4467	7458.5562	9246.7197	64.7466	46.334	11.584	21114.0508
4500	0.2424	1728.6010	7363.9741	9099.1787	65.1202	46.069	11.517	20729.8926
5000	0.2694	1704.3433	7453.2944	9068.9062	64.69	46.375	11.594	21740.6367
5500	0.2963	1664.6129	7184.9824	8969.5039	64.2668	46.68	11.67	20534.2910
6000	0.3232	1631.8164	7198.6724	8898.6348	65.558	45.761	11.44	22204.2188
6500	0.3502	1589.2347	6884.9448	8812.0322	64.8035	46.294	11.573	19131.2129
7000	0.3771	1553.9370	6727.0781	8747.2002	65.3644	45.897	11.474	18709.2949
7500	0.4040	1540.8395	6779.4512	8707.7334	64.9958	46.157	11.539	18515.4297
8000	0.4310	1519.5702	6720.9155	8684.7471	65.1941	46.016	11.504	19323.7656
8500	0.4579	1499.4967	6702.9292	8618.3145	64.6164	46.428	11.607	20303.8691
9000	0.4848	1468.8694	6597.9023	8579.7764	65.1809	46.026	11.506	19187.4902
9500	0.5118	1466.9598	6589.9976	8530.3359	64.7254	46.35	11.587	19049.6328
10000	0.5387	1450.3381	6594.1782	8527.4131	65.1904	46.019	11.505	20619.4590
10500	0.5657	1422.2881	6539.0815	8491.7549	64.9945	46.158	11.539	20106.9180
11000	0.5926	1413.1234	6447.0659	8481.6855	65.107	46.078	11.52	18302.7910
11500	0.6195	1399.7990	6463.4536	8433.2803	64.732	46.345	11.586	18501.8398
12000	0.6465	1386.2769	6439.3423	8387.9043	64.7399	46.339	11.585	18306.4570
12500	0.6734	1381.0126	6380.1401	8346.6777	64.7944	46.3	11.575	19072.5371
13000	0.7003	1360.2582	6364.1938	8351.8828	64.608	46.434	11.608	18941.8262
13500	0.7273	1355.2496	6337.5508	8364.6289	64.4743	46.53	11.633	18354.1797
14000	0.7542	1342.7577	6132.9243	8351.3281	64.4281	46.564	11.641	18108.3027
14500	0.7811	1324.4287	6172.4019	8299.2109	64.0768	46.819	11.705	17864.5078
15000	0.8081	1311.8136	6250.3555	8288.9170	63.9884	46.883	11.721	18093.8008
15500	0.8350	1300.1758	6161.9678	8240.8105	65.0003	46.154	11.538	18435.2441
16000	0.8620	1294.5092	6087.9023	8225.1836	65.3075	45.937	11.484	18195.5664
16500	0.8889	1272.7550	6124.9282	8187.4561	64.7644	46.322	11.58	18905.1719
17000	0.9158	1271.9396	6117.1646	8179.8828	66.1093	45.379	11.345	17912.2910
17500	0.9428	1263.8173	5966.3726	8165.7280	64.1579	46.76	11.69	16779.9922
18000	0.9697	1245.9607	6065.6255	8219.2422	64.3092	46.65	11.662	17666.4180
18500	0.9966	1240.7706	6013.2476	8146.3145	64.5002	46.511	11.628	16597.2520
18562	1.0000	1242.8444	5899.8604	8136.0962	64.3726	46.604	11.651	16160.9238

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0

lapp0
/

distily_bench_gpt2_optim_extended2