lapp0 commited on
Commit
88c3f06
1 Parent(s): 3a7d3a6

End of training

Browse files
README.md CHANGED
@@ -16,14 +16,14 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
- - eval_enwikippl: 452.9807
20
- - eval_frwikippl: 741.6703
21
- - eval_zhwikippl: 169.7969
22
- - eval_tinystoriesppl: 694.5760
23
- - eval_loss: 1.2502
24
- - eval_runtime: 21.1964
25
- - eval_samples_per_second: 47.178
26
- - eval_steps_per_second: 11.794
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
  should probably proofread and complete it, then remove this comment.
@@ -48,7 +48,7 @@ More information needed
48
  The following hyperparameters were used during training:
49
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
50
  - train_embeddings: True
51
- - learning_rate: 1e-05
52
  - train_batch_size: 1
53
  - eval_batch_size: 4
54
  - seed: 42
@@ -63,27 +63,27 @@ Peak GPU Memory: 3.9285 GB
63
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
64
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
65
  | **teacher eval** | | 270.2348 | 76.8142 | | | | | 671.1238 | 22.8030 |
66
- | 0 | 0 | 120078.375 | 1867851235328.0 | 18.7920 | 21.1643 | 47.249 | 11.812 | 72.8770 | 4013754155008.0 |
67
- | 5000 | 0.0505 | 399.8896 | 1364.9200 | 1.5750 | 21.223 | 47.119 | 11.78 | 430.3431 | 486.9932 |
68
- | 10000 | 0.1010 | 366.0540 | 968.1008 | 1.4975 | 21.2542 | 47.05 | 11.762 | 410.0440 | 300.9413 |
69
- | 15000 | 0.1515 | 382.6534 | 990.8644 | 1.4377 | 21.1883 | 47.196 | 11.799 | 455.3961 | 243.5069 |
70
- | 20000 | 0.2020 | 372.0864 | 985.5745 | 1.4590 | 21.2537 | 47.051 | 11.763 | 430.2186 | 317.8063 |
71
- | 25000 | 0.2525 | 459.8662 | 802.9102 | 1.3109 | 21.2174 | 47.131 | 11.783 | 674.2657 | 183.8540 |
72
- | 30000 | 0.3030 | 452.4371 | 822.7448 | 1.2777 | 21.2291 | 47.105 | 11.776 | 674.3492 | 162.7067 |
73
- | 35000 | 0.3535 | 476.7241 | 805.2602 | 1.2741 | 21.2169 | 47.132 | 11.783 | 736.0758 | 174.6150 |
74
- | 40000 | 0.4040 | 453.2438 | 770.2305 | 1.2733 | 21.1947 | 47.181 | 11.795 | 679.9471 | 163.0870 |
75
- | 45000 | 0.4545 | 460.5169 | 781.4591 | 1.2687 | 21.2116 | 47.144 | 11.786 | 700.2546 | 183.2052 |
76
- | 50000 | 0.5051 | 479.0564 | 794.0530 | 1.2632 | 21.229 | 47.105 | 11.776 | 743.4755 | 181.4419 |
77
- | 55000 | 0.5556 | 471.3993 | 748.4656 | 1.2630 | 21.215 | 47.137 | 11.784 | 731.375 | 172.6117 |
78
- | 60000 | 0.6061 | 446.4142 | 775.7834 | 1.2687 | 21.1528 | 47.275 | 11.819 | 669.1851 | 164.7928 |
79
- | 65000 | 0.6566 | 455.8672 | 744.0773 | 1.2538 | 21.2207 | 47.124 | 11.781 | 698.6068 | 164.4469 |
80
- | 70000 | 0.7071 | 453.5074 | 740.2094 | 1.2513 | 21.3501 | 46.838 | 11.71 | 697.8277 | 168.6457 |
81
- | 75000 | 0.7576 | 450.4874 | 723.2042 | 1.2535 | 21.2028 | 47.164 | 11.791 | 685.8463 | 167.9272 |
82
- | 80000 | 0.8081 | 455.6377 | 745.9662 | 1.2523 | 21.2178 | 47.13 | 11.783 | 701.7324 | 170.4892 |
83
- | 85000 | 0.8586 | 447.3922 | 746.4918 | 1.2509 | 21.2165 | 47.133 | 11.783 | 681.8325 | 168.7976 |
84
- | 90000 | 0.9091 | 453.0859 | 740.9397 | 1.2505 | 21.1987 | 47.173 | 11.793 | 696.0992 | 169.7290 |
85
- | 95000 | 0.9596 | 451.3083 | 741.0439 | 1.2504 | 21.5668 | 46.368 | 11.592 | 690.2544 | 169.7969 |
86
- | 99000 | 1.0 | 452.9807 | 741.6703 | 1.2502 | 21.1964 | 47.178 | 11.794 | 694.5760 | 169.7969 |
87
 
88
  ### Framework versions
89
  - Distily 0.2.0
 
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
+ - eval_enwikippl: 401.6902
20
+ - eval_frwikippl: 385.9396
21
+ - eval_zhwikippl: 137.9653
22
+ - eval_tinystoriesppl: 881.4292
23
+ - eval_loss: 0.7112
24
+ - eval_runtime: 21.2483
25
+ - eval_samples_per_second: 47.063
26
+ - eval_steps_per_second: 11.766
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
  should probably proofread and complete it, then remove this comment.
 
48
  The following hyperparameters were used during training:
49
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
50
  - train_embeddings: True
51
+ - learning_rate: 4e-05
52
  - train_batch_size: 1
53
  - eval_batch_size: 4
54
  - seed: 42
 
63
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
64
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
65
  | **teacher eval** | | 270.2348 | 76.8142 | | | | | 671.1238 | 22.8030 |
66
+ | 0 | 0 | 120078.375 | 1867851235328.0 | 18.7920 | 21.2125 | 47.142 | 11.786 | 72.8770 | 4013754155008.0 |
67
+ | 5000 | 0.0505 | 621.5149 | 991.7020 | 1.3528 | 21.2177 | 47.13 | 11.783 | 980.0922 | 399.9691 |
68
+ | 10000 | 0.1010 | 574.4407 | 664.8521 | 1.1590 | 21.2225 | 47.12 | 11.78 | 1036.6780 | 493.8460 |
69
+ | 15000 | 0.1515 | 543.0890 | 635.0353 | 1.0360 | 21.2351 | 47.092 | 11.773 | 1033.2988 | 145.9157 |
70
+ | 20000 | 0.2020 | 509.8121 | 599.6746 | 0.9759 | 21.2099 | 47.148 | 11.787 | 985.1690 | 251.1274 |
71
+ | 25000 | 0.2525 | 448.2854 | 486.9003 | 0.8334 | 21.2284 | 47.107 | 11.777 | 923.3450 | 171.9567 |
72
+ | 30000 | 0.3030 | 420.2149 | 441.8981 | 0.7741 | 21.2742 | 47.005 | 11.751 | 893.9037 | 129.4944 |
73
+ | 35000 | 0.3535 | 417.6187 | 442.7548 | 0.7695 | 21.5924 | 46.313 | 11.578 | 884.2755 | 140.6411 |
74
+ | 40000 | 0.4040 | 419.8570 | 418.2776 | 0.7678 | 21.23 | 47.103 | 11.776 | 893.9774 | 162.6632 |
75
+ | 45000 | 0.4545 | 420.1905 | 413.8966 | 0.7576 | 21.2355 | 47.091 | 11.773 | 905.9177 | 154.8089 |
76
+ | 50000 | 0.5051 | 420.9561 | 426.7430 | 0.7544 | 21.2196 | 47.126 | 11.782 | 906.1800 | 147.5501 |
77
+ | 55000 | 0.5556 | 417.3034 | 409.1867 | 0.7509 | 21.2021 | 47.165 | 11.791 | 902.3304 | 143.7327 |
78
+ | 60000 | 0.6061 | 418.3230 | 413.0230 | 0.7525 | 21.2367 | 47.088 | 11.772 | 894.0145 | 156.6996 |
79
+ | 65000 | 0.6566 | 404.0308 | 404.5305 | 0.7221 | 21.2003 | 47.169 | 11.792 | 878.4468 | 136.2006 |
80
+ | 70000 | 0.7071 | 406.0154 | 392.1317 | 0.7194 | 21.2119 | 47.143 | 11.786 | 891.9106 | 137.0481 |
81
+ | 75000 | 0.7576 | 400.8665 | 383.9604 | 0.7188 | 21.2118 | 47.144 | 11.786 | 871.7914 | 140.4630 |
82
+ | 80000 | 0.8081 | 402.5625 | 387.4647 | 0.7168 | 21.2234 | 47.118 | 11.779 | 882.3771 | 141.0827 |
83
+ | 85000 | 0.8586 | 399.3479 | 385.9124 | 0.7123 | 21.2047 | 47.159 | 11.79 | 875.1130 | 140.0700 |
84
+ | 90000 | 0.9091 | 401.2549 | 386.7830 | 0.7117 | 21.2316 | 47.1 | 11.775 | 881.0649 | 138.5555 |
85
+ | 95000 | 0.9596 | 401.4725 | 386.1842 | 0.7112 | 21.2217 | 47.122 | 11.78 | 880.2640 | 138.0389 |
86
+ | 99000 | 1.0 | 401.6902 | 385.9396 | 0.7112 | 21.2483 | 47.063 | 11.766 | 881.4292 | 137.9653 |
87
 
88
  ### Framework versions
89
  - Distily 0.2.0
logs/copy_teacher_modules=_(_lm_head___False)_, learning_rate=4e-05/events.out.tfevents.1724055248.f383272e719b ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b0eafccae2b1f019f22949ebd8095fe6f6385a605f0c00baea1587be79ab771
3
+ size 312