recommendedparameters for the ConstantLengthDataset

#89
by rachelshalom - opened

I'm finetuning my code on 2500 yaml files. I want the model to be able to generate these specific type pf yaml files that way
I checked the distribution of num. of tokens in each files and the majority are around 500 tokens per file.
the reason I did it is because I want to understand the optimal number sequence length that way
so I chose seq length of 512- that basically means that smaller files will be removed and larger files will be sliced with an eos token correct?

Sign up or log in to comment