SP tokenizer missing mode tokens

by keremturgutlu - opened Apr 7, 2023

Apr 7, 2023

Simply load with sp_model = spm.SentencePieceProcessor(spiece.model) and run:

sp_model.piece_to_id("[NLG]")
sp_model.piece_to_id("[S2S]")
sp_model.piece_to_id("[NLU]")

all maps to <unk>

Apr 7, 2023

•

It turns out that these are not special tokens in the vocab but rather plain text, e.g. like a prefix prompt. A bit wasteful I guess :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment