EOS Token is different in NuExtract and Phi-3
#8
by
apolo
- opened
Hello everyone.
I have extended the fine tuning using QLora and I have realized that when creating the prompt the EOS token used by the tokenizer (phi-3) is "<|endoftext|>".
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract")
tokenizer.eos_token
## <|endoftext|>
However, you have <|end-output|> in the template.
How should I prepare the prompt for the fine-tuning?
nunextract_template="""<|input|>\n### Template:\n{}\n### Text:\n{}\n### Output:\n<|output|>\n{{"people":{}}}\n<|end-output|>\n"""
or
nunextract_template="""<|input|>\n### Template:\n{}\n### Text:\n{}\n### Output:\n<|output|>\n{{"people":{}}}\n<|endoftext|>\n"""
Thanks!
Hi Apolo,
We recommended using "<|end-output|>", which operates as an eos (see generation_config.json
).
I also recommend putting a space directly before this token in your prompts because the tokenizer can sometimes merge it with other tokens, depending on what comes in front of it.
Great, I will test it.
Thank you!
apolo
changed discussion status to
closed