Text Generation
Transformers
PyTorch
code
gpt2
custom_code
Eval Results
text-generation-inference

Token inconsistency with Starcoder: fim_ or fim-

#41
by yuryya - opened

This model has special tokens, which start with "fim-", while StarCoder model uses tokens starting with "fim_". VSCode client is working with StarCoder by default, so it uses "fim_" tokens. This leads to improper work of SantaCoder when VSCode endpoint is changed to it: "fim_..." tokens are parsed as text, and the model adds them to the output from time to time.

Workaround: change token names "fim_" to "fim-" in the VSCode extension settings when SantaCoder is used.

Proposal: change "fim-" to "fim_" for this model.

Hello @yuryya , are you certain to have configured the following settings to the right values?
Screenshot 2023-10-12 at 11.15.42.png
If so, please open an issue in https://github.com/huggingface/llm-vscode with the detail of your problems.

Hello! Sure, I mentioned this way as "workaround" in my proposal.

The problem is that it is not evident way. Since StarCoder and SantaCoder are from the save vendor and for the same task, there is no good reason to look in the config again. Moreover, difference like <fim_prefix> and <fim-prefix> is too hard to notice for human, and the error manifests not every time.

Yes, problem can be solved by adding a separate template for SantaCoder in https://github.com/huggingface/llm-vscode. It will work for default configurations, while model interfaces will remain different. But it is better than nothing, I will create PR when I have time.

Maybe, we can also add a note in README like "this model uses different tokens, comparing to StarCoder (fim- instead of fim_), so be careful in the case of migration between them".

BigCode org

Hello, both models are by BigCode but it's not the same family of models e.g all StarCoder variants (15B, 7B, 3B.. ) have the same FIM tokens. But I added the note you suggested to the "How to use FIM section" in the readme https://huggingface.co/bigcode/santacoder/discussions/42.

Oh man... I am one human that totally missed the _ vs - I wish they used the same token type.

Sign up or log in to comment