What pad token should I use for fine tuning?

#10

by faizsameerahmed96 - opened May 26, 2024

May 26, 2024

I am working on fine tuning this model. In the Phi-3-mini, we were recommended to use the unk_token for padding. However it looks like this is no longer present in the tokenizer.

bapatra

Microsoft org Jun 5, 2024

Hi
You can try using the <|dummy_id_0|> token. Specifically, while creating the tokenizer, you can do something like

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-small-8k-instruct", trust_remote_code=True, pad_token="<|dummy_id_0|>")
# tokenizer.pad_token should give <|dummy_id_0|>, and tokenizer.pad_token_id should give 100264

bapatra changed discussion status to closed Jun 5, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment