Edit model card

distily_experiments_loss_kl

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 44644.5977
  • eval_frwikippl: 57738.3242
  • eval_zhwikippl: 350857.0312
  • eval_loss: 34.5918
  • eval_runtime: 91.6837
  • eval_samples_per_second: 10.907
  • eval_steps_per_second: 2.727

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_strategy: logits_activations
  • loss_fn: kl
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 2
  • eval_batch_size: 4
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 18.5713 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 13.0697 11.6518 21.6262
0 0 182369.3906 181455.125 780.3353 91.2363 10.961 2.74 180761.6719
500 0.0808 86362.2812 101159.6328 54.5867 91.4066 10.94 2.735 232064.5
1000 0.1616 69487.6797 86111.1172 53.6604 91.4579 10.934 2.733 345169.5312
1500 0.2424 82681.5703 89315.8984 35.7168 91.5034 10.929 2.732 445378.8125
2000 0.3232 57160.3164 68079.1016 35.2824 91.7023 10.905 2.726 518689.375
2500 0.4040 51816.1094 64056.9492 35.1557 91.637 10.913 2.728 541196.6875
3000 0.4848 48757.2461 61247.7969 34.9901 91.7102 10.904 2.726 490950.375
3500 0.5657 46986.8359 60241.9648 34.8887 91.7509 10.899 2.725 494001.2812
4000 0.6465 45312.1992 58663.0117 34.8189 91.5035 10.929 2.732 481694.25
4500 0.7273 44666.1914 58929.5781 34.8044 91.6675 10.909 2.727 422349.9062
5000 0.8081 44634.5938 59004.7656 34.6895 91.5871 10.919 2.73 366104.7188
5500 0.8889 43732.4805 57270.8125 34.6591 91.6394 10.912 2.728 369470.4688
6000 0.9697 43799.5938 57096.7930 34.6125 91.6634 10.909 2.727 372021.5
6187 0.9999 44644.5977 57738.3242 34.5918 91.6837 10.907 2.727 350857.0312

Framework versions

  • Distily 0.1.0
  • Transformers 4.43.3
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
3
Safetensors
Model size
494M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_experiments_loss_kl

Base model

Qwen/Qwen2-0.5B
Quantized
(49)
this model