End of training
Browse files- README.md +17 -38
- benchmarks.shelve.bak +1 -0
- benchmarks.shelve.dat +0 -0
- benchmarks.shelve.dir +1 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333565.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- tokenizer.json +2 -14
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
base_model: HuggingFaceTB/SmolLM-135M
|
3 |
datasets:
|
4 |
-
- HuggingFaceFW/fineweb
|
5 |
library_name: Distily
|
6 |
license: creativeml-openrail-m
|
7 |
tags:
|
@@ -18,7 +18,7 @@ model-index:
|
|
18 |
|
19 |
Distilled with [Distily](https://github.com/lapp0/distily) library
|
20 |
using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
|
21 |
-
on dataset [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
|
22 |
|
23 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
24 |
should probably proofread and complete it, then remove this comment.
|
@@ -80,20 +80,21 @@ LlamaForCausalLM(
|
|
80 |
- student 2: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8`
|
81 |
- student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
|
82 |
- student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
|
83 |
-
|
84 |
-
|
85 |
-
|
|
86 |
-
|
|
87 |
-
|
|
88 |
-
| tinyGSM8k.exact_match,
|
89 |
-
|
|
90 |
-
|
|
91 |
-
|
|
92 |
-
|
|
|
|
93 |
|
94 |
# Resource Usage
|
95 |
|
96 |
-
- Max Train VRAM Use: 13.
|
97 |
- Available VRAM: 23.4329 GB
|
98 |
- GPUs:
|
99 |
- 1x NVIDIA GeForce RTX 4090
|
@@ -123,28 +124,6 @@ LlamaForCausalLM(
|
|
123 |
(self_attn): LlamaSdpaAttention(
|
124 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
125 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
126 |
-
@@ -10,17 +10,16 @@
|
127 |
-
(o_proj): Linear(in_features=576, out_features=576, bias=False)
|
128 |
-
(rotary_emb): LlamaRotaryEmbedding()
|
129 |
-
)
|
130 |
-
- (mlp): LlamaMLP(
|
131 |
-
+ (mlp): LigerSwiGLUMLP(
|
132 |
-
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
|
133 |
-
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
|
134 |
-
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
|
135 |
-
- (act_fn): SiLU()
|
136 |
-
)
|
137 |
-
- (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
138 |
-
- (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
139 |
-
+ (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
140 |
-
+ (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
141 |
-
)
|
142 |
-
)
|
143 |
-
- (norm): LlamaRMSNorm((576,), eps=1e-05)
|
144 |
-
+ (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
145 |
-
(rotary_emb): LlamaRotaryEmbedding()
|
146 |
-
)
|
147 |
-
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
|
148 |
|
149 |
```
|
150 |
|
@@ -152,7 +131,7 @@ LlamaForCausalLM(
|
|
152 |
<br/>
|
153 |
|
154 |
# Train Dataset
|
155 |
-
Trained on
|
156 |
|
157 |
- Num Samples: `998,000`
|
158 |
- Subset: `sample-10BT`
|
@@ -202,7 +181,7 @@ The following hyperparameters were used during training:
|
|
202 |
weight=0
|
203 |
)
|
204 |
)`
|
205 |
-
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at
|
206 |
- student_model_name_or_path: `None`
|
207 |
- student_config_name_or_path: `None`
|
208 |
- student_model_config: `{'num_hidden_layers': 15}`
|
@@ -213,7 +192,7 @@ The following hyperparameters were used during training:
|
|
213 |
- teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
|
214 |
- teacher_load_in_8bit: `False`
|
215 |
- teacher_load_in_4bit: `False`
|
216 |
-
- dataset_uri: `HuggingFaceFW/fineweb`
|
217 |
- dataset_subset: `sample-10BT`
|
218 |
- dataset_split: `train`
|
219 |
- dataset_column_name: `text`
|
|
|
1 |
---
|
2 |
base_model: HuggingFaceTB/SmolLM-135M
|
3 |
datasets:
|
4 |
+
- HuggingFaceFW/fineweb-edu
|
5 |
library_name: Distily
|
6 |
license: creativeml-openrail-m
|
7 |
tags:
|
|
|
18 |
|
19 |
Distilled with [Distily](https://github.com/lapp0/distily) library
|
20 |
using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
|
21 |
+
on dataset [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
|
22 |
|
23 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
24 |
should probably proofread and complete it, then remove this comment.
|
|
|
80 |
- student 2: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8`
|
81 |
- student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
|
82 |
- student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
|
83 |
+
- student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
|
84 |
+
|
85 |
+
| Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 |
|
86 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
87 |
+
| tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** |
|
88 |
+
| tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 |
|
89 |
+
| tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
|
90 |
+
| tinyHellaswag.acc_norm,none | 0.452 | **0.341** | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 |
|
91 |
+
| tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 | 0.292 |
|
92 |
+
| tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 |
|
93 |
+
| tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 | 0.417 |
|
94 |
|
95 |
# Resource Usage
|
96 |
|
97 |
+
- Max Train VRAM Use: 13.1273 GB
|
98 |
- Available VRAM: 23.4329 GB
|
99 |
- GPUs:
|
100 |
- 1x NVIDIA GeForce RTX 4090
|
|
|
124 |
(self_attn): LlamaSdpaAttention(
|
125 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
126 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
|
128 |
```
|
129 |
|
|
|
131 |
<br/>
|
132 |
|
133 |
# Train Dataset
|
134 |
+
Trained on 640,425,804 tokens from the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset.
|
135 |
|
136 |
- Num Samples: `998,000`
|
137 |
- Subset: `sample-10BT`
|
|
|
181 |
weight=0
|
182 |
)
|
183 |
)`
|
184 |
+
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7d824cbaf4f0>`
|
185 |
- student_model_name_or_path: `None`
|
186 |
- student_config_name_or_path: `None`
|
187 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
|
192 |
- teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
|
193 |
- teacher_load_in_8bit: `False`
|
194 |
- teacher_load_in_4bit: `False`
|
195 |
+
- dataset_uri: `HuggingFaceFW/fineweb-edu`
|
196 |
- dataset_subset: `sample-10BT`
|
197 |
- dataset_split: `train`
|
198 |
- dataset_column_name: `text`
|
benchmarks.shelve.bak
CHANGED
@@ -4,3 +4,4 @@
|
|
4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
|
|
|
4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
7 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
benchmarks.shelve.dat
CHANGED
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
|
|
benchmarks.shelve.dir
CHANGED
@@ -4,3 +4,4 @@
|
|
4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
|
|
|
4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
7 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d78b57ac043ee94d05e8c1ba184e929678593bf39dee76cc173adacd4357a137
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:950a2485764d9a8707289ae5e36dcd0f106bad33b5437d5e88753778f1282ab5
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:46f4f0f49ae412d50e473e16ee9ba0d9c9ffba01a96132b9da302e8ed89e83ba
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:61f8d3c58bc2c445f6add695c17231a1d6aa44f075e314f683f07998d6e7603b
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333565.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3950b20d235aab15fd629f63779e509d8fa68a67d64198ccde410d368bab2fa5
|
3 |
+
size 529
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:070cb12f348bade560a253ed036f705331430f20e2c74309b499f372eb402607
|
3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9c61cd09949915a47af4cef46db34250f1ba2e1f1a56dc7b5fa1cc44f21a1eb0
|
3 |
+
size 562
|
tokenizer.json
CHANGED
@@ -1,19 +1,7 @@
|
|
1 |
{
|
2 |
"version": "1.0",
|
3 |
-
"truncation":
|
4 |
-
|
5 |
-
"max_length": 1023,
|
6 |
-
"strategy": "LongestFirst",
|
7 |
-
"stride": 0
|
8 |
-
},
|
9 |
-
"padding": {
|
10 |
-
"strategy": "BatchLongest",
|
11 |
-
"direction": "Right",
|
12 |
-
"pad_to_multiple_of": null,
|
13 |
-
"pad_id": 0,
|
14 |
-
"pad_type_id": 0,
|
15 |
-
"pad_token": "<|endoftext|>"
|
16 |
-
},
|
17 |
"added_tokens": [
|
18 |
{
|
19 |
"id": 0,
|
|
|
1 |
{
|
2 |
"version": "1.0",
|
3 |
+
"truncation": null,
|
4 |
+
"padding": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
"added_tokens": [
|
6 |
{
|
7 |
"id": 0,
|