missing tokenizer.model?

#2
by b0968 - opened

is the file "tokenizer.model" missing?
can you add it (also to all the other repos)?

LAION LeoLM org

Oh yeah, I could have seen that. Will also do :) Afaik this is not necessary if you do not set use_fast=False.

i'm trying to use it with llama.cpp, but there seems to be a different tokenizer for the chat version?
Exception: Vocab size mismatch (model has 32128, but ../leo-hessianai-7b-chat/tokenizer.model has 32000).

Seems I've uploaded a wrong tokenizer model. I've now uploaded the correct version in this repo. Please confirm that it works for you? Will then update for the others.

Sadly no, still same tokenizer.model:

 ~/git/leo-hessianai-7b-chat  main                                                                                                                                                              base   15:33:36
❯ md5sum tokenizer.model
eeec4125e9c7560836b4873b6f8e3025  tokenizer.model

 ~/git/leo-hessianai-7b-chat  main                                                                                                                                                              base   15:33:38
❯ git pull
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 4 (delta 2), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (4/4), 526 bytes | 175.00 KiB/s, done.
From https://huggingface.co/LeoLM/leo-hessianai-7b-chat
   b96b28f..7c343a5  main       -> origin/main
Updating b96b28f..7c343a5
Fast-forward

 ~/git/leo-hessianai-7b-chat  main                                                                                                                                                              base   17:29:52
❯ md5sum tokenizer.model
eeec4125e9c7560836b4873b6f8e3025  tokenizer.model
LAION LeoLM org

Not quite sure how to help with this then. After some testing on my side, the tokenizer.model seems correct (i.e. has 32006 tokens in the vocab) and also encodes correctly. The model has 32128 vocab size after padding to a multiple of 128.

Is there any option on your side to use the default huggingface tokenizer? Or otherwise see if other users of llama.cpp have run into similar issues?

same here. Replacing "32128" with "32000" for entry "vocab_size" in config.json lets llama.cpp convert the model. But then it fails with "error loading model: create_tensor: tensor 'token_embd.weight' has wrong shape; expected 4096, 32000, got 4096, 32128, 1, 1".

it seems, it boils down to a mismatch of vocabulary dimensions in tokenizer.model, config.json, tokenizer.json, and token_embd.weight - huggingface does not check this or is smart enough to adjust on the fly.

Sign up or log in to comment