What is going on with your vocab?

#10
by xzuyn - opened
"vocab_size": 128288

"128009": "<|eot_id>"
"128256": "<|eot_id|>"
"128257": "<|reserved_special_token_251|>"
...
"128287": "<|reserved_special_token_281|>"

Why increase the vocab size making merges more difficult, when you still have 243 reserved tokens you can work with? Why add a second <|eot_id|> token and rename the original to <|eot_id> when you only use ChatML format?

@xzuyn
thats actually the official llama 3's fault. For some reason, it has lots of extra reserved tokens and that <|eot_id|> issue.

@xzuyn
thats actually the official llama 3's fault. For some reason, it has lots of extra reserved tokens and that <|eot_id|> issue.

It's not though. The official one has 1 <|eot_id|> at ID 128009, but Nous renamed it to <|eot_id> and added <|eot_id|> here. The official one only has reserved tokens up to <|reserved_special_token_250|> at ID 128255, while Nous adds more starting from here to here. That's why I made this discussion page.

NousResearch org

@xzuyn
thats actually the official llama 3's fault. For some reason, it has lots of extra reserved tokens and that <|eot_id|> issue.

It's not though. The official one has 1 <|eot_id|> at ID 128009, but Nous renamed it to <|eot_id> and added <|eot_id|> here. The official one only has reserved tokens up to <|reserved_special_token_250|> at ID 128255, while Nous adds more starting from here to here. That's why I made this discussion page.

Yea that was a mistake, but the only mistake

teknium changed discussion status to closed

Sign up or log in to comment