PORTULAN
/

gervasio-7b-portuguese-ptbr-decoder

Model card Files Files and versions Community

jarodrigues commited on Mar 1, 2024

Commit

d2206f3

·

verified ·

1 Parent(s): 9f6f81d

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -111,7 +111,7 @@ These take the various fields in the dataset and arrange them into prompts, whic
 We applied supervised fine-tuning with a causal language modeling training objective following a zero-out technique during the fine-tuning process.
 Specifically, while the entire prompt received attention during fine-tuning, only the response tokens were subjected to back-propagation.
-In terms of hyper-parameters, both models were trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
 Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separate each example individually.
 In other words, each example occupies the full input sequence length.

 We applied supervised fine-tuning with a causal language modeling training objective following a zero-out technique during the fine-tuning process.
 Specifically, while the entire prompt received attention during fine-tuning, only the response tokens were subjected to back-propagation.
+In terms of hyper-parameters, the model was trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
 Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separate each example individually.
 In other words, each example occupies the full input sequence length.