weqweasdas
/

hh_rlhf_rm_open_llama_3b

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

weqweasdas commited on Jul 18, 2023

Commit

23cf908

•

1 Parent(s): 7ecd22c

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -26,9 +26,9 @@ Then, we split the dataset as follows:
 ### Training
-To use the data more efficiently, we concatenate texts with an EOS token in between and split them into 1024-sized chunks, rather than padding them according to the longest text (in each batch). We then finetune the base model on the SFT dataset for two epochs, using a learning rate of 2e-5 and a linear decay schedule.
-We conduct reward modeling with learning rate 5e-6 for 1 epoch and linear decay schedule because it seems that the model easily overfits with more than 1 epoches. We discard the samples longer than 512 tokens so we have approximately 10.6K samples in the training set and 5K samples in the test set for reward modeling.
 We use bf16 and do not use LoRA in both of the stages.

 ### Training
+To use the data more efficiently, we concatenate texts and split them into 1024-sized chunks, rather than padding them according to the longest text (in each batch). We then finetune the base model on the SFT dataset for two epochs, using a learning rate of 2e-5 and a linear decay schedule.
+We conduct reward modeling with learning rate 5e-6 for 1 epoch and linear decay schedule because it seems that the model easily overfits with more than 1 epoches. We discard the samples longer than 512 tokens so we have approximately 106K samples in the training set and 5K samples in the test set for reward modeling.
 We use bf16 and do not use LoRA in both of the stages.