EOS Token is different in NuExtract and Phi-3

#8
by apolo - opened

Hello everyone.

I have extended the fine tuning using QLora and I have realized that when creating the prompt the EOS token used by the tokenizer (phi-3) is "<|endoftext|>".

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract")
tokenizer.eos_token
## <|endoftext|>

However, you have <|end-output|> in the template.

How should I prepare the prompt for the fine-tuning?

nunextract_template="""<|input|>\n### Template:\n{}\n### Text:\n{}\n### Output:\n<|output|>\n{{"people":{}}}\n<|end-output|>\n"""

or

nunextract_template="""<|input|>\n### Template:\n{}\n### Text:\n{}\n### Output:\n<|output|>\n{{"people":{}}}\n<|endoftext|>\n"""

Thanks!

NuMind org

Hi Apolo,

We recommended using "<|end-output|>", which operates as an eos (see generation_config.json).
I also recommend putting a space directly before this token in your prompts because the tokenizer can sometimes merge it with other tokens, depending on what comes in front of it.

Great, I will test it.
Thank you!

apolo changed discussion status to closed

Sign up or log in to comment