SP tokenizer missing mode tokens

#9
by keremturgutlu - opened

Simply load with sp_model = spm.SentencePieceProcessor(spiece.model) and run:

sp_model.piece_to_id("[NLG]")
sp_model.piece_to_id("[S2S]")
sp_model.piece_to_id("[NLU]")

all maps to <unk>

linking this here: https://github.com/google-research/google-research/issues/1100

It turns out that these are not special tokens in the vocab but rather plain text, e.g. like a prefix prompt. A bit wasteful I guess :)

Sign up or log in to comment