End of training
Browse files
README.md
CHANGED
@@ -107,28 +107,6 @@ LlamaForCausalLM(
|
|
107 |
(self_attn): LlamaSdpaAttention(
|
108 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
109 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
110 |
-
@@ -10,17 +10,16 @@
|
111 |
-
(o_proj): Linear(in_features=576, out_features=576, bias=False)
|
112 |
-
(rotary_emb): LlamaRotaryEmbedding()
|
113 |
-
)
|
114 |
-
- (mlp): LlamaMLP(
|
115 |
-
+ (mlp): LigerSwiGLUMLP(
|
116 |
-
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
|
117 |
-
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
|
118 |
-
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
|
119 |
-
- (act_fn): SiLU()
|
120 |
-
)
|
121 |
-
- (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
122 |
-
- (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
123 |
-
+ (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
124 |
-
+ (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
125 |
-
)
|
126 |
-
)
|
127 |
-
- (norm): LlamaRMSNorm((576,), eps=1e-05)
|
128 |
-
+ (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
129 |
-
(rotary_emb): LlamaRotaryEmbedding()
|
130 |
-
)
|
131 |
-
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
|
132 |
|
133 |
```
|
134 |
|
@@ -136,7 +114,7 @@ LlamaForCausalLM(
|
|
136 |
<br/>
|
137 |
|
138 |
# Train Dataset
|
139 |
-
Trained on 44,
|
140 |
|
141 |
- Num Samples: `49,900`
|
142 |
- Subset: `20231101.en`
|
@@ -185,7 +163,7 @@ The following hyperparameters were used during training:
|
|
185 |
weight=0
|
186 |
)
|
187 |
)`
|
188 |
-
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at
|
189 |
- student_model_name_or_path: `None`
|
190 |
- student_config_name_or_path: `None`
|
191 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
|
107 |
(self_attn): LlamaSdpaAttention(
|
108 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
109 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
110 |
|
111 |
```
|
112 |
|
|
|
114 |
<br/>
|
115 |
|
116 |
# Train Dataset
|
117 |
+
Trained on 44,061,015 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
|
118 |
|
119 |
- Num Samples: `49,900`
|
120 |
- Subset: `20231101.en`
|
|
|
163 |
weight=0
|
164 |
)
|
165 |
)`
|
166 |
+
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7c6117e3aad0>`
|
167 |
- student_model_name_or_path: `None`
|
168 |
- student_config_name_or_path: `None`
|
169 |
- student_model_config: `{'num_hidden_layers': 15}`
|
logs/attn_weight=0, bf16=True, per_device_train_batch_size=4, run_name=bf16/events.out.tfevents.1726170813.1c1a426a2fee
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:89b005923b155722294ecdd76a75adae3712e0b8944137742eb912fba4f3e226
|
3 |
+
size 249
|