lapp0 commited on
Commit
f88f22a
1 Parent(s): 7077291

End of training

Browse files
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: roneneldan/TinyStories-33M
3
+ library_name: Distily
4
+ tags:
5
+ - generated_from_trainer
6
+ model-index:
7
+ - name: distily_bench_obj_cross_v2.2
8
+ results: []
9
+ ---
10
+
11
+ # distily_bench_obj_cross_v2.2
12
+
13
+ This student model is distilled from the teacher model [roneneldan/TinyStories-33M](https://huggingface.co/roneneldan/TinyStories-33M) using the dataset (unspecified).
14
+
15
+ The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
+
17
+ It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 201.4308
19
+ - eval_frwikippl: 134811.7969
20
+ - eval_zhwikippl: 2802169.0
21
+ - eval_tinystoriesppl: 8.5017
22
+ - eval_loss: 1.1632
23
+ - eval_runtime: 13.1959
24
+ - eval_samples_per_second: 75.781
25
+ - eval_steps_per_second: 9.473
26
+
27
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
+ should probably proofread and complete it, then remove this comment.
29
+
30
+ ## Model description
31
+
32
+ More information needed
33
+
34
+ ## Intended uses & limitations
35
+
36
+ More information needed
37
+
38
+ ## Training and evaluation data
39
+
40
+ More information needed
41
+ -->
42
+
43
+ ## Training procedure
44
+
45
+ ### Training hyperparameters
46
+
47
+ The following hyperparameters were used during training:
48
+ - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
+ - train_embeddings: True
50
+ - learning_rate: 0.04
51
+ - train_batch_size: 8
52
+ - eval_batch_size: 8
53
+ - seed: 42
54
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
55
+ - lr_scheduler_type: linear
56
+ - lr_scheduler_warmup_ratio: 0.5
57
+ - num_epochs: 1.0
58
+
59
+ ### Resource Usage
60
+ Peak GPU Memory: 8.0557 GB
61
+
62
+ ### Eval-Phase Metrics
63
+ | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
64
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
65
+ | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
66
+ | 0 | 0 | 106439.8672 | 83269.1172 | 6.7670 | 13.084 | 76.429 | 9.554 | 124919.8828 | 108523.1641 |
67
+ | 500 | 0.0404 | 540.4661 | 2634702.25 | 4.5202 | 13.2137 | 75.679 | 9.46 | 8.4174 | 83522400.0 |
68
+ | 1000 | 0.0808 | 1497.5195 | 15870873.0 | 2.2889 | 13.1749 | 75.902 | 9.488 | 14.0065 | 464464704.0 |
69
+ | 1500 | 0.1212 | 1866.7501 | 54564276.0 | 1.7749 | 13.1457 | 76.071 | 9.509 | 12.2171 | 3062959872.0 |
70
+ | 2000 | 0.1616 | 354.3223 | 675337.5625 | 1.3000 | 13.1662 | 75.952 | 9.494 | 9.5618 | 7180616.0 |
71
+ | 2500 | 0.2020 | 237.2996 | 200161.4531 | 1.2027 | 13.1313 | 76.154 | 9.519 | 9.2355 | 2578359.25 |
72
+ | 3000 | 0.2424 | 209.1669 | 144669.0625 | 1.1789 | 13.1334 | 76.142 | 9.518 | 8.7617 | 2323565.0 |
73
+ | 3500 | 0.2828 | 199.7487 | 140412.8906 | 1.1786 | 13.1764 | 75.893 | 9.487 | 8.3659 | 2391488.5 |
74
+ | 4000 | 0.3232 | 194.1800 | 130293.7578 | 1.1813 | 13.1424 | 76.089 | 9.511 | 8.1468 | 2006979.125 |
75
+ | 4500 | 0.3636 | 192.8458 | 132104.8594 | 1.1847 | 13.2475 | 75.486 | 9.436 | 8.0278 | 1976689.25 |
76
+ | 5000 | 0.4040 | 204.8733 | 171334.5781 | 1.1910 | 13.2362 | 75.55 | 9.444 | 7.8049 | 7161488.0 |
77
+ | 5500 | 0.4444 | 195.0279 | 158004.8125 | 1.1950 | 13.2423 | 75.516 | 9.439 | 7.6020 | 5050465.0 |
78
+ | 6000 | 0.4848 | 190.4297 | 152376.5 | 1.1980 | 13.2444 | 75.504 | 9.438 | 7.4381 | 3569331.0 |
79
+ | 6500 | 0.5253 | 188.9310 | 144618.1562 | 1.1982 | 13.2202 | 75.642 | 9.455 | 7.4149 | 3188412.75 |
80
+ | 7000 | 0.5657 | 194.4358 | 148488.4219 | 1.1898 | 13.2162 | 75.664 | 9.458 | 7.6775 | 3383867.0 |
81
+ | 7500 | 0.6061 | 197.3188 | 155367.4844 | 1.1877 | 13.216 | 75.666 | 9.458 | 7.7428 | 3572188.0 |
82
+ | 8000 | 0.6465 | 201.1151 | 143734.7344 | 1.1764 | 13.2338 | 75.564 | 9.446 | 8.2883 | 3349733.0 |
83
+ | 8500 | 0.6869 | 200.3220 | 141594.625 | 1.1751 | 13.1923 | 75.802 | 9.475 | 8.2352 | 3220043.0 |
84
+ | 9000 | 0.7273 | 200.8854 | 134460.8906 | 1.1695 | 13.2139 | 75.678 | 9.46 | 8.4530 | 2981093.25 |
85
+ | 9500 | 0.7677 | 201.4308 | 134811.7969 | 1.1632 | 13.1959 | 75.781 | 9.473 | 8.5017 | 2802169.0 |
86
+ | 10000 | 0.8081 | 199.1423 | 119070.3125 | 1.1540 | 13.2527 | 75.456 | 9.432 | 8.8006 | 2319847.5 |
87
+ | 10500 | 0.8485 | 196.0125 | 108148.8438 | 1.1479 | 13.2351 | 75.556 | 9.445 | 8.9736 | 2071720.5 |
88
+ | 11000 | 0.8889 | 184.3195 | 84100.1484 | 1.1397 | 13.2039 | 75.735 | 9.467 | 9.2554 | 1337898.0 |
89
+ | 11500 | 0.9293 | 187.4005 | 83163.6484 | 1.1355 | 13.2213 | 75.635 | 9.454 | 9.7272 | 1300591.875 |
90
+ | 12000 | 0.9697 | 192.9990 | 87212.625 | 1.1357 | 13.295 | 75.216 | 9.402 | 10.3592 | 1611324.75 |
91
+ | 12375 | 1.0 | 193.3956 | 88350.2109 | 1.1344 | 13.1951 | 75.786 | 9.473 | 10.2582 | 1620378.125 |
92
+
93
+ ### Framework versions
94
+ - Distily 0.2.0
95
+ - Transformers 4.44.0
96
+ - Pytorch 2.3.0
97
+ - Datasets 2.20.0
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.44.0"
6
+ }
logs/learning_rate=0.04, lr_scheduler_type=linear, warmup_ratio=0.5/events.out.tfevents.1723844851.93d6cbb3ad53 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc53912a7a0161af9b69cd5fdce139b103f6ca2eedc4489c3adee7ab1ff7e83c
3
+ size 307