are there any reserved tokens in sealion?

#2
by tiendung - opened

I'd like to instruct finetune the model using chatml format and need two unused tokens for that.

I found some special tokens, in the vocab. Which tokens are actually used in pre-training? and and should not be used is it?

0 ""
1 "<|endoftext|>"
2 "<|endofline|>"
3 "<|padding|>"
...
31 "<|en|>"
32 "<|zh|>"
33 "<|id|>"
34 "<|ms|>"
35 "<|tl|>"
36 "<|my|>"
37 "<|th|>"
38 "<|lo|>"
39 "<|km|>"
40 "<|ta|>"
41 "<|vi|>"
42 "<|python|>"
43 "<|javascript|>"
44 "<|shell|>"
45 "<|sql|>

AI Singapore org

Hi!
Thank you for checking out the model, the following tokens are unused during the pretraining,

31 "<|en|>"
32 "<|zh|>"
33 "<|id|>"
34 "<|ms|>"
35 "<|tl|>"
36 "<|my|>"
37 "<|th|>"
38 "<|lo|>"
39 "<|km|>"
40 "<|ta|>"
41 "<|vi|>"

Hope this helps.

tiendung changed discussion status to closed

Sign up or log in to comment