Added token

by zokica - opened Apr 5

Discussion

zokica

Apr 5

Hi,

How did you add these tokens:

"<|im_end|>": 50295,
"<|im_start|>": 50296,
"<|startoftext|>": 50297

rhysjones

Owner Apr 6

Hey @zokica ,

They are added as part of the first training step - the process is to add any new tokens to the the tokenizer, noting if they're a special token such as eos_token etc. If the tokenizer size is increased, it's also necessary to save the lm_head and embed_tokens modules as well.

There is a detailed blog post on this here and if you're using a framework such as Axolotl then you can easily state which new tokens you're adding in your config file.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment