File size: 3,647 Bytes

---
base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
- generated_from_trainer
model-index:
- name: distily_bench_obj_cross_v2
  results: []
---

# distily_bench_obj_cross_v2

This student model is distilled from the teacher model [roneneldan/TinyStories-33M](https://huggingface.co/roneneldan/TinyStories-33M) using the dataset (unspecified).

The [Distily](https://github.com/lapp0/distily) library was used for this distillation.

It achieves the following results on the evaluation set:
- eval_enwikippl: 1882.2876
- eval_frwikippl: 38923.2266
- eval_zhwikippl: 63461.6641
- eval_tinystoriesppl: 451.2739
- eval_loss: 4.8257
- eval_runtime: 13.1445
- eval_samples_per_second: 76.078
- eval_steps_per_second: 9.51

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment.

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed
-->

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
- train_embeddings: True
- learning_rate: 0.0004
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine_with_restarts
- num_epochs: 1.0

### Resource Usage
Peak GPU Memory: 8.1729 GB

### Eval-Phase Metrics
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
| 0 | 0 | 10909.4980 | 77116.0 | 6.3550 | 13.1937 | 75.794 | 9.474 | 4267.7983 | 73081.2031 |
| 1000 | 0.0808 | 1884.7683 | 38923.2266 | 4.8260 | 13.1354 | 76.13 | 9.516 | 453.2929 | 63529.4258 |
| 2000 | 0.1616 | 1882.5793 | 38923.2266 | 4.8257 | 13.2412 | 75.522 | 9.44 | 451.5352 | 63461.6641 |
| 3000 | 0.2424 | 1882.5793 | 38923.2266 | 4.8257 | 13.2384 | 75.538 | 9.442 | 451.6844 | 63461.6641 |
| 4000 | 0.3232 | 1881.7043 | 38923.2266 | 4.8257 | 13.2242 | 75.619 | 9.452 | 450.9009 | 63461.6641 |
| 5000 | 0.4040 | 1883.1630 | 38923.2266 | 4.8257 | 13.1558 | 76.012 | 9.501 | 451.8337 | 63461.6641 |
| 6000 | 0.4848 | 1883.1630 | 38923.2266 | 4.8257 | 13.2198 | 75.644 | 9.456 | 451.8337 | 63461.6641 |
| 7000 | 0.5657 | 1884.4762 | 38923.2266 | 4.8257 | 13.2183 | 75.653 | 9.457 | 452.8433 | 63529.4258 |
| 8000 | 0.6465 | 1882.5793 | 38923.2266 | 4.8257 | 13.1236 | 76.198 | 9.525 | 451.4604 | 63461.6641 |
| 9000 | 0.7273 | 1882.2876 | 38923.2266 | 4.8257 | 13.1445 | 76.078 | 9.51 | 451.2739 | 63461.6641 |
| 10000 | 0.8081 | 1880.2477 | 38923.2266 | 4.8257 | 13.2204 | 75.641 | 9.455 | 450.4167 | 63461.6641 |
| 11000 | 0.8889 | 1882.5793 | 38923.2266 | 4.8257 | 13.267 | 75.375 | 9.422 | 451.7592 | 63461.6641 |
| 12000 | 0.9697 | 1883.1630 | 38923.2266 | 4.8257 | 13.182 | 75.861 | 9.483 | 451.8337 | 63461.6641 |
| 12375 | 1.0 | 1883.1630 | 38923.2266 | 4.8257 | 13.202 | 75.746 | 9.468 | 451.8337 | 63461.6641 |

### Framework versions
- Distily 0.2.0
- Transformers 4.44.0
- Pytorch 2.3.0
- Datasets 2.20.0