distily_bench_obj_cross_v2.13_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=cos, layer_mapper=uniform_cons, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0

Peak GPU Memory: 8.0905 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	1821066133504.0	158329674399744.0	20.2008	12.884	46.569	11.642	12079595520.0	98956046499840.0
750	0.1010	1376.0	5824.0	3.1064	12.9246	46.423	11.606	972.0	101888.0
1500	0.2020	584.0	3552.0	2.2186	12.958	46.304	11.576	438.0	1012.0
2250	0.3030	376.0	1888.0	1.9252	12.9544	46.316	11.579	312.0	366.0
3000	0.4040	266.0	1072.0	1.6667	13.0268	46.059	11.515	227.0	203.0
3750	0.5051	211.0	736.0	1.4766	12.9492	46.335	11.584	175.0	233.0
4500	0.6061	171.0	588.0	1.2986	13.0596	45.943	11.486	141.0	147.0
5250	0.7071	134.0	480.0	1.1348	12.9613	46.292	11.573	110.5	154.0
6000	0.8081	125.0	456.0	1.0662	12.9543	46.317	11.579	100.0	129.0
6750	0.9091	119.0	440.0	1.0325	12.9319	46.397	11.599	96.0	122.5
7425	1.0	118.0	436.0	1.0267	12.9764	46.238	11.559	94.5	122.0