OSError: Unable to load vocabulary from file

#47

by khurramnaseem - opened Apr 9

Apr 9

What is the possible cause of the following error?

File "/Users/kn/mylangchainenv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2327, in _from_pretrained
raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

srowen

Databricks org Apr 9

How are you loading the tokenizer?
Are you sure your copy of the files (either in a local dir, or in the HF cache) is accessible and not corrupt? You could try re-downloading the file if there is no other apparent reason.

khurramnaseem

Apr 9

Hey Owen,

I have following test script.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True, token="")
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, token="")

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

srowen

Databricks org Apr 9

I presume you specified your token in your actual code, or else you'd have a different error.
Was there any error in download? what is your HF cache path (that is, what file is it reading) and can you delete that part of the cache and try again?

abhi-db

Databricks org Apr 11

Hi @khurramnaseem , we just updated the tokenizer to use the standard GPT2Tokenizer class, could you try again and let me know if it works?

khurramnaseem

Apr 12

•

edited Apr 12

Hey @abhi-db
Yes! seem its work now, it ask me to do "pip install accelerate" and after done so it start downloading following files.
model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.3k/29.3k [00:00<00:00, 1.03MB/s]
model-00001-of-00061.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 3.52G/3.52G [10:49<00:00, 5.42MB/s]
model-00002-of-00061.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 4.40G/4.40G [18:16<00:00, 4.02MB/s]
model-00003-of-00061.safetensors:

But seem its a lots of data, I'm not sure what is the purpose of all these files.

eitanturok

Databricks org Apr 14

All of these files you are downloading are simply the model weights. More specifically, the files that end in .safetensors are files that contain model weights. We also saved our model weights in 61 different files because our model is "shared" into different pieces. This is normal :)

khurramnaseem

Apr 15

thank you @eitanturok & @abhi-db for all the help.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment