missing tokenizer.model?

by b0968 - opened Sep 29, 2023

Discussion

b0968

Sep 29, 2023

is the file "tokenizer.model" missing?
can you add it (also to all the other repos)?

bjoernp

LAION LeoLM org Sep 29, 2023

Oh yeah, I could have seen that. Will also do :) Afaik this is not necessary if you do not set use_fast=False.

b0968

Sep 29, 2023

i'm trying to use it with llama.cpp, but there seems to be a different tokenizer for the chat version?
Exception: Vocab size mismatch (model has 32128, but ../leo-hessianai-7b-chat/tokenizer.model has 32000).

bjoernp

LAION LeoLM org Sep 29, 2023

•

edited Sep 29, 2023

Seems I've uploaded a wrong tokenizer model. I've now uploaded the correct version in this repo. Please confirm that it works for you? Will then update for the others.

b0968

Sep 29, 2023

Sadly no, still same tokenizer.model:

 ~/git/leo-hessianai-7b-chat  main                                                                                                                                                              base   15:33:36
❯ md5sum tokenizer.model
eeec4125e9c7560836b4873b6f8e3025  tokenizer.model

 ~/git/leo-hessianai-7b-chat  main                                                                                                                                                              base   15:33:38
❯ git pull
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 4 (delta 2), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (4/4), 526 bytes | 175.00 KiB/s, done.
From https://huggingface.co/LeoLM/leo-hessianai-7b-chat
   b96b28f..7c343a5  main       -> origin/main
Updating b96b28f..7c343a5
Fast-forward

 ~/git/leo-hessianai-7b-chat  main                                                                                                                                                              base   17:29:52
❯ md5sum tokenizer.model
eeec4125e9c7560836b4873b6f8e3025  tokenizer.model

bjoernp

LAION LeoLM org Sep 29, 2023

Not quite sure how to help with this then. After some testing on my side, the tokenizer.model seems correct (i.e. has 32006 tokens in the vocab) and also encodes correctly. The model has 32128 vocab size after padding to a multiple of 128.

Is there any option on your side to use the default huggingface tokenizer? Or otherwise see if other users of llama.cpp have run into similar issues?

arv2023

Sep 30, 2023

•

edited Oct 2, 2023

same here. Replacing "32128" with "32000" for entry "vocab_size" in config.json lets llama.cpp convert the model. But then it fails with "error loading model: create_tensor: tensor 'token_embd.weight' has wrong shape; expected 4096, 32000, got 4096, 32128, 1, 1".

b0968

Oct 2, 2023

it seems, it boils down to a mismatch of vocabulary dimensions in tokenizer.model, config.json, tokenizer.json, and token_embd.weight - huggingface does not check this or is smart enough to adjust on the fly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment