Questions about the BOS, EOS, and UNK/PAD tokenizer changes

by flyingkiwiguy - opened Jun 27, 2023

Jun 27, 2023

In the model card you mention:

"The included tokenizer is based on that of the baseline model, however the BOS, EOS, and UNK/PAD tokens are distinctly defined, which was not the case with the baseline"

Can you explain a bit more the motivation behind this change to the tokenizer? I notice that a lot of fine tuning of llama-cpp uses tokenizers with the add_eos_token=True flag set.

ttronrud

StarFish Medical ML/AI Lab org Jun 27, 2023

The choice was mostly for fine-tuning, which requires a pad token. I haven't noticed any issues with text generation, so I haven't changed it.

The EOS() token will prevent run-on generation, and since I brace my fine-tuning data with ~~...~~, the model is trained to end appropriately.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment