Questions about the BOS, EOS, and UNK/PAD tokenizer changes

#1
by flyingkiwiguy - opened

In the model card you mention:

"The included tokenizer is based on that of the baseline model, however the BOS, EOS, and UNK/PAD tokens are distinctly defined, which was not the case with the baseline"

Can you explain a bit more the motivation behind this change to the tokenizer? I notice that a lot of fine tuning of llama-cpp uses tokenizers with the add_eos_token=True flag set.

StarFish Medical ML/AI Lab org

The choice was mostly for fine-tuning, which requires a pad token. I haven't noticed any issues with text generation, so I haven't changed it.

The EOS() token will prevent run-on generation, and since I brace my fine-tuning data with ..., the model is trained to end appropriately.

Sign up or log in to comment