lapp0's picture
End of training
f70aff9 verified
|
raw
history blame
2.37 kB
metadata
base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_modelcard_try
    results: []

distily_modelcard_try

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 38656.0
  • eval_frwikippl: 218112.0
  • eval_zhwikippl: 54001664.0
  • eval_tinystoriesppl: 12160.0
  • eval_loss: 6.4375
  • eval_runtime: 0.0668
  • eval_samples_per_second: 29.948
  • eval_steps_per_second: 14.974

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=mse, layer_mapper=layer-2, projector=None))
  • train_embeddings: True
  • learning_rate: 0.0001
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • lr_scheduler_warmup_ratio: 0.2
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 15.4263 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 43.75 61.75 11.8125 19.125
0 0 738734374912.0 47828755808256.0 20.375 0.128 15.619 7.81 2617245696.0 12232066859008.0
10 1.0 38656.0 218112.0 6.4375 0.0668 29.948 14.974 12160.0 54001664.0

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0