lapp0 commited on
Commit
86f60e4
1 Parent(s): 13de31e

End of training

Browse files
README.md CHANGED
@@ -1,83 +1,235 @@
1
  ---
2
- library_name: transformers
3
- license: apache-2.0
4
  base_model: HuggingFaceTB/SmolLM-135M
 
 
 
 
5
  tags:
6
  - generated_from_trainer
 
 
7
  model-index:
8
  - name: distily_smollm_dataset_sweep
9
  results: []
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
-
15
- # distily_smollm_dataset_sweep
16
 
17
- This model is a fine-tuned version of [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) on an unknown dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 0.2647
20
 
21
- ## Model description
 
 
22
 
23
- More information needed
 
24
 
25
- ## Intended uses & limitations
26
 
27
  More information needed
28
 
29
- ## Training and evaluation data
30
 
31
  More information needed
32
-
33
- ## Training procedure
34
-
35
- ### Training hyperparameters
36
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  The following hyperparameters were used during training:
38
- - learning_rate: 0.0001
39
- - train_batch_size: 8
40
- - eval_batch_size: 4
41
- - seed: 42
42
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
43
- - lr_scheduler_type: polynomial
44
- - lr_scheduler_warmup_ratio: 0.1
45
- - num_epochs: 1.0
46
-
47
- ### Training results
48
-
49
- | Training Loss | Epoch | Step | Validation Loss |
50
- |:-------------:|:------:|:------:|:---------------:|
51
- | No log | 0 | 0 | 18.8388 |
52
- | 1.2041 | 0.0401 | 5000 | 1.1584 |
53
- | 0.7528 | 0.0802 | 10000 | 0.7396 |
54
- | 0.5961 | 0.1202 | 15000 | 0.6070 |
55
- | 0.5023 | 0.1603 | 20000 | 0.5307 |
56
- | 0.4706 | 0.2004 | 25000 | 0.4836 |
57
- | 0.4605 | 0.2405 | 30000 | 0.4512 |
58
- | 0.417 | 0.2806 | 35000 | 0.4251 |
59
- | 0.4027 | 0.3206 | 40000 | 0.4071 |
60
- | 0.3693 | 0.3607 | 45000 | 0.3898 |
61
- | 0.3745 | 0.4008 | 50000 | 0.3759 |
62
- | 0.3652 | 0.4409 | 55000 | 0.3632 |
63
- | 0.3537 | 0.4810 | 60000 | 0.3529 |
64
- | 0.3665 | 0.5210 | 65000 | 0.3440 |
65
- | 0.3177 | 0.5611 | 70000 | 0.3346 |
66
- | 0.3102 | 0.6012 | 75000 | 0.3269 |
67
- | 0.3023 | 0.6413 | 80000 | 0.3198 |
68
- | 0.3076 | 0.6814 | 85000 | 0.3125 |
69
- | 0.3388 | 0.7214 | 90000 | 0.3062 |
70
- | 0.298 | 0.7615 | 95000 | 0.3003 |
71
- | 0.3052 | 0.8016 | 100000 | 0.2941 |
72
- | 0.2678 | 0.8417 | 105000 | 0.2880 |
73
- | 0.2684 | 0.8818 | 110000 | 0.2824 |
74
- | 0.274 | 0.9218 | 115000 | 0.2764 |
75
- | 0.2647 | 0.9619 | 120000 | 0.2706 |
76
-
77
-
78
- ### Framework versions
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  - Transformers 4.45.0.dev0
81
  - Pytorch 2.5.0.dev20240910+cu121
82
  - Datasets 2.21.0
83
- - Tokenizers 0.19.1
 
1
  ---
 
 
2
  base_model: HuggingFaceTB/SmolLM-135M
3
+ datasets:
4
+ - HuggingFaceFW/fineweb
5
+ library_name: Distily
6
+ license: creativeml-openrail-m
7
  tags:
8
  - generated_from_trainer
9
+ - Distily
10
+ base_model_relation: finetune
11
  model-index:
12
  - name: distily_smollm_dataset_sweep
13
  results: []
14
  ---
15
 
 
 
 
 
16
 
17
+ # Summary
 
 
18
 
19
+ Distilled with [Distily](https://github.com/lapp0/distily) library
20
+ using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
21
+ on dataset [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
22
 
23
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
24
+ should probably proofread and complete it, then remove this comment.
25
 
26
+ # Model description
27
 
28
  More information needed
29
 
30
+ # Intended uses & limitations
31
 
32
  More information needed
33
+ -->
34
+
35
+ # Model Architecture:
36
+ - **Architecture**: `LlamaForCausalLM`
37
+ - **Total Parameters**: 81,413,568
38
+ - **Data Type (dtype)**: torch.float32
39
+ - **Model Size**: 0.30 GB
40
+
41
+ <details>
42
+ <summary>Student Model Details</summary>
43
+
44
+ ```
45
+ LlamaForCausalLM(
46
+ (model): LlamaModel(
47
+ (embed_tokens): Embedding(49152, 576)
48
+ (layers): ModuleList(
49
+ (0-14): 15 x LlamaDecoderLayer(
50
+ (self_attn): LlamaSdpaAttention(
51
+ (q_proj): Linear(in_features=576, out_features=576, bias=False)
52
+ (k_proj): Linear(in_features=576, out_features=192, bias=False)
53
+ (v_proj): Linear(in_features=576, out_features=192, bias=False)
54
+ (o_proj): Linear(in_features=576, out_features=576, bias=False)
55
+ (rotary_emb): LlamaRotaryEmbedding()
56
+ )
57
+ (mlp): LigerSwiGLUMLP(
58
+ (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
59
+ (up_proj): Linear(in_features=576, out_features=1536, bias=False)
60
+ (down_proj): Linear(in_features=1536, out_features=576, bias=False)
61
+ )
62
+ (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
63
+ (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
64
+ )
65
+ )
66
+ (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
67
+ (rotary_emb): LlamaRotaryEmbedding()
68
+ )
69
+ (lm_head): Linear(in_features=576, out_features=49152, bias=False)
70
+ )
71
+ ```
72
+
73
+ </details>
74
+ <br/>
75
+
76
+ # Benchmark Metrics Comparison
77
+
78
+ | Metric | distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8 | distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8 | distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8 | distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8 | logs/teacher |
79
+ | :--- | :--- | :--- | :--- | :--- | :--- |
80
+ | tinyArc.acc_norm,none | 0.303 | 0.295 | 0.26 | 0.302 | 0.37 |
81
+ | tinyGSM8k.exact_match,flexible-extract | 0.029 | 0.03 | 0.006 | 0.025 | 0.006 |
82
+ | tinyGSM8k.exact_match,strict-match | 0.006 | 0.006 | 0.006 | 0.006 | 0.006 |
83
+ | tinyHellaswag.acc_norm,none | 0.341 | 0.281 | 0.3 | 0.327 | 0.452 |
84
+ | tinyMMLU.acc_norm,none | 0.276 | 0.281 | 0.286 | 0.31 | 0.341 |
85
+ | tinyTruthfulQA.acc,none | 0.463 | 0.447 | 0.419 | 0.423 | 0.38 |
86
+ | tinyWinogrande.acc_norm,none | 0.466 | 0.436 | 0.492 | 0.46 | 0.509 |
87
+
88
+ # Resource Usage
89
+
90
+ - Max Train VRAM Use: 13.1269 GB
91
+ - Available VRAM: 23.4329 GB
92
+ - GPUs:
93
+ - 1x NVIDIA GeForce RTX 4090
94
+ - CPUs: 64
95
+ - CPU Memory: 251.7299 GB
96
+ - CPU Memory Bandwidth: 1600 GB/s
97
+
98
+ # Distillation (Teacher -> Student) Architecture Difference:
99
+
100
+ - **Architecture**: `LlamaForCausalLM` -> `LlamaForCausalLM`
101
+ - **Total Parameters**: 134,515,008 -> 81,413,568
102
+ - **Data Type (dtype)**: torch.float32 -> torch.float32
103
+ - **Model Size**: 0.25 GB -> 0.30 GB
104
+
105
+ <details>
106
+ <summary>Module Diff Details</summary>
107
+
108
+ ```diff
109
+ --- teacher model modules
110
+ +++ student model modules
111
+ @@ -2,7 +2,7 @@
112
+ (model): LlamaModel(
113
+ (embed_tokens): Embedding(49152, 576)
114
+ (layers): ModuleList(
115
+ - (0-29): 30 x LlamaDecoderLayer(
116
+ + (0-14): 15 x LlamaDecoderLayer(
117
+ (self_attn): LlamaSdpaAttention(
118
+ (q_proj): Linear(in_features=576, out_features=576, bias=False)
119
+ (k_proj): Linear(in_features=576, out_features=192, bias=False)
120
+ @@ -10,17 +10,16 @@
121
+ (o_proj): Linear(in_features=576, out_features=576, bias=False)
122
+ (rotary_emb): LlamaRotaryEmbedding()
123
+ )
124
+ - (mlp): LlamaMLP(
125
+ + (mlp): LigerSwiGLUMLP(
126
+ (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
127
+ (up_proj): Linear(in_features=576, out_features=1536, bias=False)
128
+ (down_proj): Linear(in_features=1536, out_features=576, bias=False)
129
+ - (act_fn): SiLU()
130
+ )
131
+ - (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
132
+ - (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
133
+ + (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
134
+ + (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
135
+ )
136
+ )
137
+ - (norm): LlamaRMSNorm((576,), eps=1e-05)
138
+ + (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
139
+ (rotary_emb): LlamaRotaryEmbedding()
140
+ )
141
+ (lm_head): Linear(in_features=576, out_features=49152, bias=False)
142
+
143
+ ```
144
+
145
+ </details>
146
+ <br/>
147
+
148
+ # Train Dataset
149
+ Trained on 501,164,413 tokens from the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset.
150
+
151
+ - Num Samples: `998,000`
152
+ - Subset: `sample-10BT`
153
+ - Split: `train`
154
+
155
+
156
+ # Training Objective
157
+
158
+ ```
159
+ DistillationObjective(
160
+ logits_loss_component=LossComponent(
161
+ weight=1,
162
+ loss_fn='kl'
163
+ ),
164
+ hs_loss_component=LossComponent(
165
+ weight=0
166
+ ),
167
+ attn_loss_component=LossComponent(
168
+ weight=0
169
+ )
170
+ )
171
+ ```
172
+
173
+ # Hyperparameters
174
  The following hyperparameters were used during training:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
+ <details>
177
+ <summary>Expand</summary>
178
+
179
+ - learning_rate: `0.0001`
180
+ - train_batch_size: `8`
181
+ - eval_batch_size: `4`
182
+ - seed: `42`
183
+ - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
184
+ - lr_scheduler_type: `polynomial`
185
+ - lr_scheduler_warmup_ratio: `0.1`
186
+ - num_epochs: `1.0`
187
+ - distillation_objective: `DistillationObjective(
188
+ logits_loss_component=LossComponent(
189
+ weight=1,
190
+ loss_fn='kl'
191
+ ),
192
+ hs_loss_component=LossComponent(
193
+ weight=0
194
+ ),
195
+ attn_loss_component=LossComponent(
196
+ weight=0
197
+ )
198
+ )`
199
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7205cc5db070>`
200
+ - student_model_name_or_path: `None`
201
+ - student_config_name_or_path: `None`
202
+ - student_model_config: `{'num_hidden_layers': 15}`
203
+ - reinitialize_weights: `None`
204
+ - copy_teacher_modules: `[('lm_head', False)]`
205
+ - student_model_as_bitnet: `False`
206
+ - student_use_liger_kernel: `True`
207
+ - teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
208
+ - teacher_load_in_8bit: `False`
209
+ - teacher_load_in_4bit: `False`
210
+ - dataset_uri: `HuggingFaceFW/fineweb`
211
+ - dataset_subset: `sample-10BT`
212
+ - dataset_split: `train`
213
+ - dataset_column_name: `text`
214
+ - dataset_sample_size: `1000000`
215
+ - dataset_max_seq_length: `1024`
216
+ - dataset_test_size: `0.002`
217
+ - dataset_shuffle: `False`
218
+ - dataset_shuffle_seed: `42`
219
+ - dataset_trust_remote_code: `False`
220
+ - gradient_accumulation_steps: `1`
221
+ - weight_decay: `0.0`
222
+ - max_grad_norm: `1.0`
223
+ - warmup_ratio: `0.1`
224
+ - warmup_steps: `0`
225
+ - gradient_checkpointing: `True`
226
+
227
+ </details>
228
+ <br/>
229
+
230
+
231
+ # Framework Versions
232
+ - Distily 0.5.0
233
  - Transformers 4.45.0.dev0
234
  - Pytorch 2.5.0.dev20240910+cu121
235
  - Datasets 2.21.0
 
benchmarks.shelve.bak CHANGED
@@ -2,3 +2,4 @@
2
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (512, 448)
3
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
 
 
2
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (512, 448)
3
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
benchmarks.shelve.dat CHANGED
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
 
benchmarks.shelve.dir CHANGED
@@ -2,3 +2,4 @@
2
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (512, 448)
3
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
 
 
2
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (512, 448)
3
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727245509.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:817ca7d0896bd3aeb6fb1bcfd3a6da536f2658490279cbd43e920fbebbfa9838
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727245509.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c773e5e524db38b7421c140c09ff664a7b9123e19eb26cd1eadbe3b1687862e
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727245069.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb5b7721e4ed99470b12fad6a02ed811da840bdb71a83d6efc6eb1d4fb74c02c
3
+ size 529
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727245509.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:603de8f730ea9dd7ad40f010b156c997fba29094effbafe9d8034cad531eed90
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727245509.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c16ab5b45b959065ee903ba6c05e11f35df910dddfeb6a9224d7b50e5a06822
3
+ size 562
tokenizer.json CHANGED
@@ -1,19 +1,7 @@
1
  {
2
  "version": "1.0",
3
- "truncation": {
4
- "direction": "Right",
5
- "max_length": 1023,
6
- "strategy": "LongestFirst",
7
- "stride": 0
8
- },
9
- "padding": {
10
- "strategy": "BatchLongest",
11
- "direction": "Right",
12
- "pad_to_multiple_of": null,
13
- "pad_id": 0,
14
- "pad_type_id": 0,
15
- "pad_token": "<|endoftext|>"
16
- },
17
  "added_tokens": [
18
  {
19
  "id": 0,
 
1
  {
2
  "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
 
 
 
 
 
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 0,