--- base_model: gpt2 library_name: Distily license: mit tags: - generated_from_trainer model-index: - name: distily_modelcard_try results: [] --- # distily_modelcard_try This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified). The [Distily](https://github.com/lapp0/distily) library was used for this distillation. It achieves the following results on the evaluation set: - eval_enwikippl: 38656.0 - eval_frwikippl: 218112.0 - eval_zhwikippl: 54001664.0 - eval_tinystoriesppl: 12160.0 - eval_loss: 6.4375 - eval_runtime: 0.0668 - eval_samples_per_second: 29.948 - eval_steps_per_second: 14.974 ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=mse, layer_mapper=layer-2, projector=None)) - train_embeddings: True - learning_rate: 0.0001 - train_batch_size: 16 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: constant - lr_scheduler_warmup_ratio: 0.2 - num_epochs: 1.0 ### Resource Usage Peak GPU Memory: 15.4263 GB ### Eval-Phase Metrics | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 | | 0 | 0 | 738734374912.0 | 47828755808256.0 | 20.375 | 0.128 | 15.619 | 7.81 | 2617245696.0 | 12232066859008.0 | | 10 | 1.0 | 38656.0 | 218112.0 | 6.4375 | 0.0668 | 29.948 | 14.974 | 12160.0 | 54001664.0 | ### Framework versions - Distily 0.2.0 - Transformers 4.44.0 - Pytorch 2.3.0 - Datasets 2.21.0