distily_experiments_loss_mse

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 42178.9805
  • eval_frwikippl: 45721.7109
  • eval_zhwikippl: 198318.0
  • eval_loss: 0.0786
  • eval_runtime: 81.5551
  • eval_samples_per_second: 12.262
  • eval_steps_per_second: 3.065

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_strategy: logits_activations
  • loss_fn: mse
  • train_embeddings: True
  • learning_rate: 0.0001
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 12.6346 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 13.0697 11.6518 21.6262
0 0 181673.8281 182246.2969 0.4342 81.4623 12.276 3.069 181831.9062
500 0.0808 51828.1172 61691.7969 0.0957 81.6929 12.241 3.06 207997.2812
1000 0.1616 45041.4180 54485.3594 0.0860 81.5919 12.256 3.064 209966.75
1500 0.2424 44320.2031 52669.4492 0.0833 81.6149 12.253 3.063 212294.125
2000 0.3232 41822.9727 48903.5156 0.0852 81.4039 12.284 3.071 223025.1094
2500 0.4040 41753.0312 49089.7227 0.0827 81.431 12.28 3.07 220010.75
3000 0.4848 42048.8906 49907.2539 0.0840 81.49 12.271 3.068 210042.8594
3500 0.5657 41577.9180 46600.7773 0.0849 81.4631 12.275 3.069 221650.6875
4000 0.6465 41242.4766 46538.6875 0.0824 81.561 12.261 3.065 211307.8125
4500 0.7273 41470.4414 46413.6094 0.0803 81.5887 12.257 3.064 221236.5469
5000 0.8081 42138.4922 45917.2109 0.0799 81.4339 12.28 3.07 207382.2812
5500 0.8889 41578.75 45413.4766 0.0818 81.4181 12.282 3.071 233708.25
6000 0.9697 42023.4766 45574.0469 0.0790 81.4797 12.273 3.068 208369.9375
6187 0.9999 42178.9805 45721.7109 0.0786 81.5551 12.262 3.065 198318.0

Framework versions

  • Distily 0.1.0
  • Transformers 4.43.3
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
4
Safetensors
Model size
494M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_experiments_loss_mse

Base model

Qwen/Qwen2-0.5B
Quantized
(49)
this model