nicholasKluge
/

TeenyTinyLlama-160m

@@ -66,37 +66,33 @@ This repository has the [source code](https://github.com/Nkluge-correa/Aira) use
 ## Training Set-up
-| Section        | Setting                     | Value                                |
-|----------------|-----------------------------|--------------------------------------|
-| Model args.    | vocab_size                  | 32000                                |
-|                | hidden_size                 | 768                                  |
-|                | intermediate_size           | 3072                                 |
-|                | max_position_embeddings     | 2048                                 |
-|                | num_attention_heads         | 12                                   |
-|                | num_hidden_layers           | 12                                   |
-|                | num_key_value_heads         | 12                                   |
-|                | torch_dtype                 | "float32"                           |
-| Data args.     | dataset_name                | "nicholasKluge/portuguese-corpus-v3" |
-|                | dataset_split               | "train"                              |
-|                | train_num_samples           | 1831873                              |
-|                | val_num_samples             | 18000                                |
-|                | block_size                  | 2048                                 |
-| Training args. | evaluation_strategy         | "steps"                              |
-|                | eval_steps                  | 100000                               |
-|                | per_device_train_batch_size | 4                                    |
-|                | per_device_eval_batch_size  | 4                                    |
-|                | gradient_accumulation_steps | 1                                    |
-|                | learning_rate               | 0.0006                               |
-|                | adam_epsilon                | 0.00000001                           |
-|                | weight_decay                | 0.01                                 |
-|                | lr_scheduler_type           | "cosine"                             |
-|                | warmup_ratio                | 0.01                                 |
-|                | num_train_epochs            | 1                                    |
-|                | gradient_checkpointing      | false                                |
-|                | seed                        | 42                                   |
-|                | mixed_precision             | 'no'                                 |
-|                | checkpointing_steps         | 22000                                |
-|                | tf32                        | true                                 |
 ## Basic usage

 ## Training Set-up
+| Arguments                     | Value                                |
+|-------------------------------|--------------------------------------|
+| vocabulary size               | 32000                                |
+| hidden dimension size         | 768                                  |
+| intermediate dimension size   | 3072                                 |
+| context length                | 2048                                 |
+| nº attention heads            | 12                                   |
+| nº hidden layers              | 12                                   |
+| nº key value heads            | 12                                   |
+| nº training samples           | 1831873                              |
+| nº validation samples         | 18000                                |
+| nº epochs                     | 1                                    |
+| evaluation steps				| 100000							   |
+| train batch size              | 4                                    |
+| eval batch size               | 4                                    |
+| gradient accumulation steps   | 1                                    |
+| learning rate                 | 0.0006                               |
+| adam epsilon                  | 0.00000001                           |
+| weight decay                  | 0.01                                 |
+| scheduler type                | "cosine"                             |
+| warmup ratio                  | 0.01                                 |
+| gradient checkpointing        | false                                |
+| seed                          | 42                                   |
+| mixed precision               | 'no'                                 |
+| torch dtype                   | "float32"                            |
+| tf32                          | true                                 |
 ## Basic usage