metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_gpt2_attn
    results: []

distily_bench_gpt2_attn

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=2.0, loss_fn=kl, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Peak GPU Memory: 8.2195 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	55429.6875	57698.8047	6.1743	17.396	57.484	7.186	56988.9141
1000	0.0808	682.2277	4513.8320	2.0387	17.3415	57.665	7.208	21742.0879
2000	0.1616	493.2831	3192.1013	1.8548	17.3304	57.702	7.213	1917.9761
3000	0.2424	408.6152	2650.7046	1.7483	17.3279	57.71	7.214	937.2945
4000	0.3232	362.1653	2422.9863	1.6582	17.3944	57.49	7.186	807.3055
5000	0.4040	311.0092	2075.7251	1.5707	17.3884	57.51	7.189	967.0451
6000	0.4848	271.9341	1744.2100	1.4998	17.372	57.564	7.195	798.9407
7000	0.5657	249.6316	1538.4886	1.4376	17.3071	57.78	7.222	768.1817
8000	0.6465	225.5740	1397.4233	1.3836	17.3097	57.771	7.221	701.6876
9000	0.7273	207.4599	1342.3768	1.3331	17.3049	57.787	7.223	657.2436
10000	0.8081	189.0748	1151.9358	1.2846	17.3724	57.563	7.195	561.3511
11000	0.8889	173.5948	1120.0602	1.2337	17.3912	57.5	7.188	488.3670
12000	0.9697	157.5976	1006.0906	1.1896	17.3686	57.575	7.197	640.5209
12375	1.0	156.4636	960.7520	1.1773	17.446	57.32	7.165	627.6509