fix vocab size

by jphme - opened Nov 30, 2023

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

-1

jphme

Nov 30, 2023

from transformers import AutoTokenizer
testtokenizer=AutoTokenizer.from_pretrained("LeoLM/leo-mistral-hessianai-7b-chat")
len(testtokenizer)
# 32002

Leads to e.g. VLLM error:
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'
(see here)

fix vocab size966d6a51

bjoernp

LAION LeoLM org Nov 30, 2023

Have you tested this? The model's weights have 32128 embedding dim so I feel like this would break no?

jphme

Nov 30, 2023

Have you tested this? The model's weights have 32128 embedding dim so I feel like this would break no?

No I didn't test this and according to the docs you could be right see here .

Does it work with VLLM for you? See also example config.json from OpenOrca for comparison. Probably related to resize_token_embeddings_to_32x (but why its not 32032 then?) .

And seems to be an issue e.g. also here: https://github.com/huggingface/transformers/issues/4875

I have no idea what the right solution is or whether this is more a bug in VLLM; probably it would work to resize token embeddings after training again ( model.resize_token_embeddings(embeddings_len)) to get a match for usable vocab size and embeddings?

Feel free to close, just wanted to make you aware of this issue :).

bjoernp

LAION LeoLM org Nov 30, 2023

I think the real solution is to 1. raise an issue with vLLM and hope they fix it or 2. add dummy tokens to the tokenizer. I resized the embeddings to a multiple of 128 since this is what is apparently most efficient on h100+ GPUs. Your idea of resizing back down might also be a good and easy solution. I don't think the speed loss should be too great.

BirdThomas

Dec 7, 2023

I am trying to convert the model to gguf, llama.cpp complains about a Vocab size mismatch (model has 32128, but tokenizer.model has 32000). (I removed all from added_tokens.json. I can sure "fix" the vocab_size in the config which will eventually lead to an error loading the model: 'token_embd.weight' has wrong shape; expected 4096, 32000, got 4096, 32128,
Any ideas?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.