Training in progress, step 61875

Browse files

Files changed (5) hide show

README.md +32 -32
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0.1/events.out.tfevents.1723776807.93d6cbb3ad53 +3 -0
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723776606.93d6cbb3ad53 +2 -2
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 198.0270
-- eval_frwikippl: 17127.3379
-- eval_zhwikippl: 63614.1797
-- eval_tinystoriesppl: 21.7514
-- eval_loss: 12.7123
-- eval_runtime: 65.1143
-- eval_samples_per_second: 76.788
-- eval_steps_per_second: 9.599
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 0.001
 - train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
@@ -62,31 +62,31 @@ Peak GPU Memory: 8.2666 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 14504.7236 | 73076.2578 | 17.7824 | 65.8473 | 75.933 | 9.492 | 6091.4858 | 69506.0391 |
-| 3000 | 0.0485 | 198.0270 | 17127.3379 | 12.7127 | 65.1257 | 76.775 | 9.597 | 21.7514 | 63648.1641 |
-| 6000 | 0.0970 | 197.0935 | 17088.7852 | 12.7124 | 65.0473 | 76.867 | 9.608 | 21.6429 | 63750.0977 |
-| 9000 | 0.1455 | 198.0270 | 17127.3379 | 12.7123 | 65.1143 | 76.788 | 9.599 | 21.7514 | 63614.1797 |
-| 12000 | 0.1939 | 197.4985 | 17098.4199 | 12.7128 | 65.0689 | 76.842 | 9.605 | 21.7173 | 63886.3086 |
-| 15000 | 0.2424 | 197.3838 | 17122.5215 | 12.7128 | 65.1155 | 76.787 | 9.598 | 21.6850 | 63954.5195 |
-| 18000 | 0.2909 | 197.9733 | 17108.0586 | 12.7130 | 65.0798 | 76.829 | 9.604 | 21.7865 | 63818.1680 |
-| 21000 | 0.3394 | 197.1698 | 17103.2305 | 12.7134 | 65.0799 | 76.829 | 9.604 | 21.6322 | 64022.8672 |
-| 24000 | 0.3879 | 197.8507 | 17117.7051 | 12.7131 | 65.1719 | 76.72 | 9.59 | 21.7757 | 63716.1211 |
-| 27000 | 0.4364 | 197.6975 | 17127.3379 | 12.7131 | 65.1124 | 76.79 | 9.599 | 21.7191 | 63954.5195 |
-| 30000 | 0.4848 | 197.3150 | 17079.1562 | 12.7131 | 65.1085 | 76.795 | 9.599 | 21.6904 | 63920.4375 |
-| 33000 | 0.5333 | 197.6209 | 17103.2305 | 12.7129 | 65.2897 | 76.582 | 9.573 | 21.7191 | 63750.0977 |
-| 36000 | 0.5818 | 198.3110 | 17127.3379 | 12.7122 | 65.5537 | 76.273 | 9.534 | 21.7883 | 63614.1797 |
-| 39000 | 0.6303 | 198.2802 | 17127.3379 | 12.7128 | 65.2001 | 76.687 | 9.586 | 21.7721 | 63648.1641 |
-| 42000 | 0.6788 | 197.9580 | 17127.3379 | 12.7130 | 65.4586 | 76.384 | 9.548 | 21.7433 | 63512.4648 |
-| 45000 | 0.7273 | 198.2802 | 17108.0586 | 12.7129 | 65.2819 | 76.591 | 9.574 | 21.8009 | 63614.1797 |
-| 48000 | 0.7758 | 197.5979 | 17098.4199 | 12.7125 | 65.1997 | 76.688 | 9.586 | 21.6940 | 63648.1641 |
-| 51000 | 0.8242 | 198.2802 | 17127.3379 | 12.7120 | 65.5503 | 76.277 | 9.535 | 21.7892 | 63512.4648 |
-| 54000 | 0.8727 | 198.2189 | 17127.3379 | 12.7129 | 65.3863 | 76.469 | 9.559 | 21.7811 | 63716.1211 |
-| 57000 | 0.9212 | 199.1886 | 17136.9941 | 12.7133 | 65.1649 | 76.728 | 9.591 | 21.8759 | 63343.2148 |
-| 60000 | 0.9697 | 197.3226 | 17122.5215 | 12.7127 | 65.2079 | 76.678 | 9.585 | 21.6886 | 63648.1641 |
-| 61875 | 1.0 | 198.9419 | 17127.3379 | 12.7117 | 65.2766 | 76.597 | 9.575 | 21.8469 | 63648.1641 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
-- Datasets 2.21.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 149.5442
+- eval_frwikippl: 28142.1230
+- eval_zhwikippl: 243104.3594
+- eval_tinystoriesppl: 11.2706
+- eval_loss: 7.4452
+- eval_runtime: 66.0052
+- eval_samples_per_second: 75.752
+- eval_steps_per_second: 9.469
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 0.004
 - train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 58069.8203 | 77442.5625 | 18.5372 | 65.9335 | 75.834 | 9.479 | 46072.8867 | 100550.5078 |
+| 3000 | 0.0485 | 145.3919 | 28068.8965 | 7.4450 | 66.1007 | 75.642 | 9.455 | 10.8792 | 239563.0469 |
+| 6000 | 0.0970 | 143.8738 | 27934.7852 | 7.4443 | 65.9587 | 75.805 | 9.476 | 10.7017 | 239818.8281 |
+| 9000 | 0.1455 | 149.5442 | 28142.1230 | 7.4452 | 66.0052 | 75.752 | 9.469 | 11.2706 | 243104.3594 |
+| 12000 | 0.1939 | 141.4481 | 28096.5879 | 7.4447 | 66.228 | 75.497 | 9.437 | 10.4616 | 242650.5938 |
+| 15000 | 0.2424 | 141.9365 | 27532.4258 | 7.4447 | 66.1198 | 75.62 | 9.453 | 10.5402 | 235884.5 |
+| 18000 | 0.2909 | 150.4271 | 28453.0332 | 7.4452 | 65.9158 | 75.854 | 9.482 | 11.3604 | 248415.0781 |
+| 21000 | 0.3394 | 148.4649 | 27337.2715 | 7.4450 | 65.7078 | 76.094 | 9.512 | 11.2888 | 229674.3125 |
+| 24000 | 0.3879 | 149.7760 | 28039.2520 | 7.4446 | 65.7891 | 76.0 | 9.5 | 11.2827 | 240716.3594 |
+| 27000 | 0.4364 | 141.8706 | 28049.1211 | 7.4454 | 65.831 | 75.952 | 9.494 | 10.5280 | 235255.9062 |
+| 30000 | 0.4848 | 144.7906 | 28084.7207 | 7.4449 | 65.9422 | 75.824 | 9.478 | 10.7119 | 240716.3594 |
+| 33000 | 0.5333 | 149.6832 | 28237.4258 | 7.4454 | 65.807 | 75.98 | 9.497 | 11.2524 | 244013.9531 |
+| 36000 | 0.5818 | 148.6030 | 27445.3125 | 7.4453 | 65.6651 | 76.144 | 9.518 | 11.2729 | 236893.5625 |
+| 39000 | 0.6303 | 142.9683 | 27676.2949 | 7.4447 | 65.6729 | 76.135 | 9.517 | 10.6589 | 235381.5781 |
+| 42000 | 0.6788 | 146.5510 | 27895.4648 | 7.4449 | 65.6881 | 76.117 | 9.515 | 10.9904 | 239690.7812 |
+| 45000 | 0.7273 | 149.2144 | 28023.4531 | 7.4454 | 65.9058 | 75.866 | 9.483 | 11.2701 | 240716.3594 |
+| 48000 | 0.7758 | 144.2086 | 28243.4043 | 7.4449 | 65.873 | 75.904 | 9.488 | 10.7022 | 244339.7188 |
+| 51000 | 0.8242 | 141.9915 | 27781.7559 | 7.4450 | 65.9256 | 75.843 | 9.48 | 10.5589 | 239563.0469 |
+| 54000 | 0.8727 | 145.6399 | 28219.5234 | 7.4451 | 65.6892 | 76.116 | 9.515 | 10.8693 | 238542.6094 |
+| 57000 | 0.9212 | 144.2365 | 27040.4609 | 7.4445 | 65.6838 | 76.122 | 9.515 | 10.8312 | 227175.6875 |
+| 60000 | 0.9697 | 144.3482 | 26979.5938 | 7.4447 | 65.623 | 76.193 | 9.524 | 10.9257 | 232138.6562 |
+| 61875 | 1.0 | 146.9147 | 28084.7207 | 7.4450 | 65.6082 | 76.21 | 9.526 | 10.9569 | 237146.5 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
+- Datasets 2.20.0

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0.1/events.out.tfevents.1723776807.93d6cbb3ad53 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bbd708cc84d84f0879498daaa89f221a21472c839adf7e210a0701a804c45340
+size 16923356

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723776606.93d6cbb3ad53 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b89b1687def63d5c64d9c3882e7a24672822a08f9b6d6ff28bccd53e82cdb4ff
-size 312

 version https://git-lfs.github.com/spec/v1
+oid sha256:4aa31525857f66345f637a1366b25d246ccfe894e9c9feefa8da900b171ecd03
+size 588

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e536919613b3b4f9e76519f3ca3e4effec3088c0d4b2ca7613f361ffc185cf9b
 size 137033984

 version https://git-lfs.github.com/spec/v1
+oid sha256:e02c3ac22ff0d23f01f139c12fab9df94791fcaf3e24d1dc8340b3319a0d1408
 size 137033984

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5e9fc10eac5690311752511e3268f064327bfb0dcabcd1d7e7eb6dbe9be1169a
 size 1017948104

 version https://git-lfs.github.com/spec/v1
+oid sha256:50afd6a3d26dc6ab323236a90fbecceb23cb8eb9fda0e3a0c7d87b83e42d077d
 size 1017948104