EOS not tokenized correctly

#9
by Stopwolf - opened

After I tried training a base model with the same chat format, the model doesn't tokenize eos token correctly.
Both [28789, 28766, 321, 28730, 416, 28766, 28767] and [32000] result in <|im_end|>, but when outputting text it is tokenized as the former.

What I noticed is that its only tokenized as EOS (32000) when there's a space preceding it, but more realistically there's always some text beforehand..

Examples:
2 × 3 = 6<|im_end|> => [1, 28705, 28750, 15770, 28705, 28770, 327, 28705, 28784, 28789, 28766, 321, 28730, 416, 28766, 28767]
2 × 3 = 6 <|im_end|> => [1, 28705, 28750, 15770, 28705, 28770, 327, 28705, 28784, 32000]

Any idea how to fix this for further finetuning?

Sign up or log in to comment