how to add special tokens?

#21
by Moses25 - opened

I wan to add other language tokens based on this tokenizer, how to do that?

This tokenizer is a tiktoken tokenizer, so it's not using the huggingface tokenizer. If you look at model.config.vocab_size and then len(tokenizer) you'll see that there's a mismatch of vocab sizes. Still working on figuring out what to do about it. But that's why tokenizer.add_special_tokens() doesn't work: the implementation is tiktoken so the functionality is different.

You can take a look at their tokenizer.py file to see how to add tokens and such: https://huggingface.co/Salesforce/xgen-7b-8k-base/blob/main/tokenization_xgen.py

Moses25 changed discussion status to closed

Sign up or log in to comment