distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 84.0
eval_frwikippl: 336.0
eval_zhwikippl: 143.0
eval_tinystoriesppl: 68.0
eval_loss: 0.6821
eval_runtime: 16.9876
eval_samples_per_second: 58.866
eval_steps_per_second: 7.358

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.7252 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	2473901162496.0	170424302305280.0	20.7680	16.9731	58.917	7.365	4060086272.0	71468255805440.0
1000	0.0404	328.0	1488.0	1.5338	17.0565	58.629	7.329	243.0	608.0
2000	0.0808	229.0	792.0	1.3211	16.9911	58.854	7.357	190.0	260.0
3000	0.1212	178.0	624.0	1.1600	17.0018	58.817	7.352	149.0	176.0
4000	0.1616	147.0	580.0	1.0397	17.0215	58.749	7.344	119.0	161.0
5000	0.2020	128.0	516.0	0.9532	17.0153	58.771	7.346	102.0	159.0
6000	0.2424	111.0	410.0	0.8655	17.0046	58.808	7.351	90.0	147.0
7000	0.2828	104.5	410.0	0.8083	16.9742	58.913	7.364	82.0	145.0
8000	0.3232	97.5	382.0	0.7412	16.9735	58.915	7.364	74.0	128.0
9000	0.3636	84.0	336.0	0.6821	16.9876	58.866	7.358	68.0	143.0
10000	0.4040	77.5	312.0	0.6396	16.9771	58.903	7.363	65.0	140.0
11000	0.4444	75.5	280.0	0.5964	17.02	58.754	7.344	60.75	122.5
12000	0.4848	74.5	268.0	0.5797	16.9985	58.829	7.354	58.0	152.0
13000	0.5253	71.5	274.0	0.5537	16.9566	58.974	7.372	58.25	134.0
14000	0.5657	72.0	252.0	0.5429	16.9325	59.058	7.382	58.0	99.0
15000	0.6061	69.0	229.0	0.5308	16.9917	58.852	7.357	51.25	94.0
16000	0.6465	67.0	223.0	0.5209	16.9686	58.932	7.367	52.5	108.0
17000	0.6869	67.5	227.0	0.5046	16.979	58.896	7.362	54.25	118.0
18000	0.7273	67.5	244.0	0.5024	16.994	58.844	7.356	50.5	128.0
19000	0.7677	66.0	212.0	0.4931	16.9719	58.921	7.365	49.25	88.0
20000	0.8081	64.5	202.0	0.4925	17.0171	58.764	7.346	49.75	169.0
21000	0.8485	67.0	222.0	0.4839	16.9754	58.909	7.364	47.75	126.0
22000	0.8889	66.0	227.0	0.4759	16.9314	59.062	7.383	48.0	100.0
23000	0.9293	61.75	208.0	0.4704	16.9662	58.941	7.368	47.25	125.5
24000	0.9697	66.0	210.0	0.4706	17.0394	58.688	7.336	47.5	173.0
24750	1.0	63.75	218.0	0.4686	16.9798	58.894	7.362	46.75	82.5

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0

lapp0
/

distily_bench_obj_cross_v2.15_gpt2