End of training

Browse files

Files changed (2) hide show

README.md +5 -26
logs/dataset_max_seq_length=512, dataset_sample_size=2000000, per_device_train_batch_size=16, warmup_ratio=0.1/events.out.tfevents.1726422177.1c1a426a2fee +3 -0

README.md CHANGED Viewed

@@ -107,28 +107,6 @@ LlamaForCausalLM(
          (self_attn): LlamaSdpaAttention(
            (q_proj): Linear(in_features=576, out_features=576, bias=False)
            (k_proj): Linear(in_features=576, out_features=192, bias=False)
-@@ -10,17 +10,16 @@
-           (o_proj): Linear(in_features=576, out_features=576, bias=False)
-           (rotary_emb): LlamaRotaryEmbedding()
-         )
--        (mlp): LlamaMLP(
-+        (mlp): LigerSwiGLUMLP(
-           (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
-           (up_proj): Linear(in_features=576, out_features=1536, bias=False)
-           (down_proj): Linear(in_features=1536, out_features=576, bias=False)
--          (act_fn): SiLU()
-         )
--        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
--        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
-+        (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
-+        (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
-       )
-     )
--    (norm): LlamaRMSNorm((576,), eps=1e-05)
-+    (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
-     (rotary_emb): LlamaRotaryEmbedding()
-   )
-   (lm_head): Linear(in_features=576, out_features=49152, bias=False)
 ```
@@ -136,7 +114,7 @@ LlamaForCausalLM(
 <br/>
 # Train Dataset
-Trained on 687,245,234 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
 - Num Samples: `1,996,000`
 - Subset: `20231101.en`
@@ -166,12 +144,13 @@ The following hyperparameters were used during training:
 <details>
 <summary>Expand</summary>
-- learning_rate: `0.0001`
 - train_batch_size: `16`
 - eval_batch_size: `2`
 - seed: `42`
 - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
 - lr_scheduler_type: `polynomial`
 - num_epochs: `1.0`
 - distillation_objective: `DistillationObjective(
     logits_loss_component=LossComponent(
@@ -185,7 +164,7 @@ The following hyperparameters were used during training:
         weight=0
     )
 )`
-- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x76c9e8244e20>`
 - student_model_name_or_path: `None`
 - student_config_name_or_path: `None`
 - student_model_config: `{'num_hidden_layers': 15}`
@@ -209,7 +188,7 @@ The following hyperparameters were used during training:
 - gradient_accumulation_steps: `1`
 - weight_decay: `0.0`
 - max_grad_norm: `1.0`
-- warmup_ratio: `0.0`
 - warmup_steps: `0`
 - gradient_checkpointing: `True`

          (self_attn): LlamaSdpaAttention(
            (q_proj): Linear(in_features=576, out_features=576, bias=False)
            (k_proj): Linear(in_features=576, out_features=192, bias=False)
 ```
 <br/>
 # Train Dataset
+Trained on 687,248,443 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
 - Num Samples: `1,996,000`
 - Subset: `20231101.en`
 <details>
 <summary>Expand</summary>
+- learning_rate: `0.0002`
 - train_batch_size: `16`
 - eval_batch_size: `2`
 - seed: `42`
 - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
 - lr_scheduler_type: `polynomial`
+- lr_scheduler_warmup_ratio: `0.1`
 - num_epochs: `1.0`
 - distillation_objective: `DistillationObjective(
     logits_loss_component=LossComponent(
         weight=0
     )
 )`
+- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x76ca0d527850>`
 - student_model_name_or_path: `None`
 - student_config_name_or_path: `None`
 - student_model_config: `{'num_hidden_layers': 15}`
 - gradient_accumulation_steps: `1`
 - weight_decay: `0.0`
 - max_grad_norm: `1.0`
+- warmup_ratio: `0.1`
 - warmup_steps: `0`
 - gradient_checkpointing: `True`

logs/dataset_max_seq_length=512, dataset_sample_size=2000000, per_device_train_batch_size=16, warmup_ratio=0.1/events.out.tfevents.1726422177.1c1a426a2fee ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f67b97d9b0dce17aa6806d6ece3a1b8db3aaccd655ecf36fb23a26bdfc8ed070
+size 529