Is `added_tokens.json` intended to be here?

#43
by xzuyn - opened

It's adding the default tokens, breaks converting to gguf, and also isn't included with the Mistral-instruct model.

Was it accidentally added?

Mistral AI_ org

Hello @xzuyn ,

I'm sorry you're having trouble using mistralai/Mistral-7B-v0.1.

In the release of transformers v4.34, some breaking changes regarding the tokenization API were introduced, I suspect your issue is linked to this. What version of transformers are you using ?

What version of transformers are you using ?

4.34.0

Is there a reason there's an added_tokens.json here though? Mistral-7B-Instruct-v0.1 doesn't have it. It also seems completely reduntant to have since its just adding tokens which already exist, so its creating a problem for no reason (unless there's something I missed, then in that case my mistake).

Deleting the file allowed me to convert the model to GGUF perfectly fine since the other config files already specify the tokens.

Is there a reason there's an added_tokens.json here though? Mistral-7B-Instruct-v0.1 doesn't have it. It also seems completely reduntant to have since its just adding tokens which already exist, so its creating a problem for no reason (unless there's something I missed, then in that case my mistake).

This file is there because it is saved by the save_pretrained method with transformers 4.34.0. Unfortunately, I don't know what API for tokenizers transformers maintainers want to do in the future given the recent changes, in other words, whether the file should be there or not. @ArthurZ can you guide us on what to do please? That's sad that a user see this issue with transformers 4.34.0.

About Mistral-7B-Instruct-v0.1's tokenizer, we need to change its tokenizer too to match the one in this repo. With the recent tokenizer related changes inside transformers, the tokenizer doesn't handle the special tokens as we expected.

The file will have to stay for forward compatibility (people who did not update to 4.34 should still be able to load a new tokenizer for older tokenizers classes for example), but this one in particular was wrong. A patch will be published soon and I believe the file should be removed!

Sign up or log in to comment