metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross
    results: []

distily_bench_obj_cross

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 24580.0566
eval_frwikippl: 58429.5703
eval_zhwikippl: 90638.1875
eval_tinystoriesppl: 13633.8428
eval_loss: 18.8988
eval_runtime: 32.6253
eval_samples_per_second: 76.628
eval_steps_per_second: 9.594

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 16.2498 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	30500.8262	64429.8789	19.1222	32.5358	76.838	9.62	17883.2402	92396.1641
2000	0.1293	24580.0566	58429.5703	18.8980	32.4735	76.986	9.639	13633.8428	90638.1875
4000	0.2586	24580.0566	58429.5703	18.8980	32.5203	76.875	9.625	13633.8428	90638.1875
6000	0.3879	24580.0566	58429.5703	18.8988	32.628	76.621	9.593	13633.8428	90638.1875
8000	0.5172	24580.0566	58429.5703	18.8988	32.6253	76.628	9.594	13633.8428	90638.1875
10000	0.6465	24580.0566	58429.5703	18.8988	32.4883	76.951	9.634	13633.8428	90638.1875
12000	0.7757	24580.0566	58429.5703	18.8980	32.4949	76.935	9.632	13633.8428	90638.1875
14000	0.9050	24580.0566	58429.5703	18.8988	32.507	76.906	9.629	13633.8428	90638.1875
15469	1.0	24580.0566	58429.5703	18.8988	32.6353	76.604	9.591	13633.8428	90638.1875

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0