Confirming the EOS token? 32021 or 32014? Or both?

#1
by TheBloke - opened

Hi

I'm having issues with my GGUF quantisations where the model won't stop generating, and generates endless <|EOT|> tokens.

I made the GGUF with special tokens set as per tokenizer_config.json, ie EOS is set to token ID 32014

But in your README I realised you're actually setting it to 32021 for the Instruct models?

# 32021 is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=32021)

So I just wanted to double check that for Instruct, EOS should be set to 32021 and that tokenizer_config.json is wrong in this regard?

Is there a reason that tokenizer_config.json and config.json don't have EOS set to 32021, but rather to 32014? Would you consider changing that, or do other aspects of model generation depend on 32014?

Thanks

TB

For instruct model, the eos_id is 32021, i.e. <|EOT|> token. For base model, the eos_id is 32014, i.e. . We will reset the eos_id for different models. Thanks for your pointing it.

Great, thank you for confirming that quickly.

I will re-make all my Instruct GGUF files once you've been able to update the tokenizer config.

DeepSeek org

I have fixed the mistakes in the instruction models. Thanks!

Thanks very much - but could you do tokenizer_config.json also? Or I can do a PR if you like

Chester111 changed discussion status to closed

Sign up or log in to comment