Added token

#5
by zokica - opened

Hi,

How did you add these tokens:

"<|im_end|>": 50295,
"<|im_start|>": 50296,
"<|startoftext|>": 50297

Hey @zokica ,

They are added as part of the first training step - the process is to add any new tokens to the the tokenizer, noting if they're a special token such as eos_token etc. If the tokenizer size is increased, it's also necessary to save the lm_head and embed_tokens modules as well.

There is a detailed blog post on this here and if you're using a framework such as Axolotl then you can easily state which new tokens you're adding in your config file.

Sign up or log in to comment