Why there is not pad token?

#101
by Imran1 - opened

I don't know how this tokenizer are designed, can you explain why there is no pad token ?
If I use the eos token as pad then the model showing bad behavior in response like, " the response are really small just 25, or 30 tokens. However, my dataset have short sample from 250 to 1000+. I mean each answer have 250 words and 1000 and so on.

The main issues is, how to pass or prepare this data in which we have equal batch size. I try to use packing but it's not working.

Kindly you explain this in detail.
Thanks for your time..

Regard
Imran

the eos_token is wrong, that's actually a pad_token. It was already fixed in 8B, it will be fixed here soon after if I am not mistaken. The correct eos_token is <|eot_id|>.

@MaziyarPanahi
So can you show a method, how to prepare a data for training?

I mean what token I should need to choose a pad one and eos ?

If I add custom token into tokenizer, it and resize the model embadding it's showing error after calling trainer.train().

Like attention something error...

I am not see any pad token in the tokenizer config. Its showing only eos token

I use "<|eot_id|>" for eos and for pad you either introduce a new <pad> token yourself during the training and use that for padding, or you can reuse <|end_of_text|> as a pad. (it's up to you when it comes to Llama-3)

@MaziyarPanahi I really appreciate your work. I already have installed some of your gguf models. Could you share some examples of your text generation pipeline?

@Imran1 pad token is not used since years because in the pre-training is never used. All the CasualLM are trained using the maximum length for computation efficiency.
If you really need it, you can use the eos token and mask it in the labels, or better concatenate all the sample during the fine tuning using packing from SFT library.

Please, avoid to add token because by default they are initialized with 0s embedding. So, it became the most likely during the softmax and it will be always choosed in the early fine tuning steps.
If you need some specific token, follow this guide from Stanford https://nlp.stanford.edu/~johnhew/vocab-expansion.html

@matteoperiani great and thanks for the response.
I also thought this. Can you share a blog post or something that I want to read and to know how the new model work, especially in fine tuning so I will update my knowledge.

This comment has been hidden

@matteoperiani thank you, I will check it. I have good experience in model fine tuning. But I want to know the new approach. It will definitely help. Thank you...

This comment has been hidden

@matteoperiani I know Lora, Dora, qlora etc. actually, the llama3 use new tokenizer, and have no pad token and added some other new special token, by the I want to understand that one.
I want to understand how they are working. I should need to read a few blogs related this new tokenizer.

Some guys use, eos token as a pad one..
As you told, they LLM follow to use max token length as pad or trl library using packing.

I have a dataset which have 28k sample data. I fine tune sft, ORPO approach... But the sft model generate large token or response but not make any sense that really related to my domain dataset. Easily get hallucinated...

I try the ORPO, which also lead the model to very small response and not make any sense and some time repeat the sentence and then add eos token at the end.
I did some parameters fine tuning.
Like batch size , learning BLA BLA.
But these new model, llama3, phi3, not doing very well.

I not find any good blog or article on that one, which explains in depths how to do BLA BLA.

Sorry, I didn’t understand your question before :).

If your LoRA SFT does no have results I may suspect could be caused by the lora rank or alpha. if you have not already tryed, use a bigger rank, something like 128 or 256. If not work, try ti modify the alpha value, a quite easy rule of thumb is use alpha = rank as starting point.

If you are using a 4bit quantization, try with 8. Theoretically, 8bit have performance like 16bit, so it will be like a full model finetuning.

In the case both test not produce results, you may test with a smaller model and full finetuning it. It will give poor results, but you my tweak the input-output format If you find strange behavior. A finetuned GPT2 is able to produce output as you want (ignoring how wrong the content is). So even if it is not able to generate the output as you desire, try to change the finetuning data format.

Last but not lest, decoder are very sensitive to generation parameters. Try different configuraration with all the models, you may increase the results only by few lines of code.

For what concerning ORPO, I didn’t use it a lot, so my knowledge would not helpful.

@matteoperiani I am using A100 GPUs. I can easily use Lora, Dora etc with multiple GPU. I try the Qwen model. So the performance of Qwen model is good.
My dataset is look like this.
Having instructions, question, reject answer and choose answer.
I syntheticly generate it using RAG.
And we'll prepare for sft, dpo, ORPO.

Thank you. You help a lots.

Are you using the base model or instruct/chat finetuned?

@matteoperiani I am using using both.

I try Qwen LLM with galore.

And your loss decrease on validation loss during alla the fine tuning stage reaching lower values?

@matteoperiani yeah definitely...

@matteoperiani here is the info of orpo ...

ls.PNG
_ds.PNG

@Imran1 hi, I'm also using qwen and have problems with padding.
I find that when I use left padding, i.e., have padding tokens to the left of the system prompt, the models' generation will be greatly altered.
More specifically, model with padded inputs will not follow the system instructions as unpadded ones.
Have you ever had this problem or is there maybe something wrong with my code?
Thanks!

This comment has been hidden

@para-zhou Hey, actually my dataset is not so good. My model response is good as my dataset output/response columns.

I try with unpadded token and use eos token as pad one by this the model repeat then answer. Actually qwen model have own pad token try that one, and add response in the data collector class. Like |<im_start>|assistant .
Make sure to check the documentation. The trl library are updating day by day.

Sign up or log in to comment