tokenizer.model is missing

by YuxinXiao - opened Jul 25, 2023

Jul 25, 2023

Hi, I tried to load the tokenizer of this model, but met the following error: TypeError: not a string.
I think it is because tokenizer.model is missing in this entry.
Could you please check and upload it? Thanks!

Limerobot

upstage org Jul 26, 2023

@YuxinXiao
Hi,
Could you share your transformers version? I will check tokenizer loading in your transformers version. FYI, our transformers version is '4.31.0'.

wonhosong

upstage org Jul 26, 2023

@YuxinXiao Hello,
I tested with the code below and it was successful in loading the tokenizer.

tokenizer = AutoTokenizer.from_pretrained(
    "upstage/llama-65b-instruct",
    force_download=True
)

I've confirmed that it works on both transformer==4.30.0 and transformer==4.30.1.

YuxinXiao

Jul 26, 2023

Hi, I'm using transformer==4.31.0.
When I run

name = 'upstage/llama-65b-instruct'
tokenizer = AutoTokenizer.from_pretrained(name, use_fast=False, force_download=True)

I get the following error

TypeError                                 Traceback (most recent call last)
Cell In[5], line 2
      1 name = 'upstage/llama-65b-instruct'
----> 2 tokenizer = AutoTokenizer.from_pretrained(name, use_fast=False, force_download=True)
      3 # model = AutoModelForCausalLM.from_pretrained(name, trust_remote_code=True, low_cpu_mem_usage=True, torch_dtype=torch.float16, device_map='auto')

File ~/miniconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:702, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    698     if tokenizer_class is None:
    699         raise ValueError(
    700             f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    701         )
--> 702     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    704 # Otherwise we have to be creative.
    705 # if model is an encoder decoder, the encoder tokenizer class is used by default
    706 if isinstance(config, EncoderDecoderConfig):

File ~/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1841, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
   1838     else:
   1839         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 1841 return cls._from_pretrained(
   1842     resolved_vocab_files,
   1843     pretrained_model_name_or_path,
   1844     init_configuration,
   1845     *init_inputs,
   1846     use_auth_token=token,
   1847     cache_dir=cache_dir,
   1848     local_files_only=local_files_only,
   1849     _commit_hash=commit_hash,
   1850     _is_local=is_local,
   1851     **kwargs,
   1852 )

File ~/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2004, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   2002 # Instantiate tokenizer.
   2003 try:
-> 2004     tokenizer = cls(*init_inputs, **init_kwargs)
   2005 except OSError:
   2006     raise OSError(
   2007         "Unable to load vocabulary from file. "
   2008         "Please check that the provided vocabulary is accessible and not corrupted."
   2009     )

File ~/miniconda3/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py:144, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, legacy, **kwargs)
    142 self.add_eos_token = add_eos_token
    143 self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
--> 144 self.sp_model.Load(vocab_file)

File ~/miniconda3/lib/python3.10/site-packages/sentencepiece/__init__.py:905, in SentencePieceProcessor.Load(self, model_file, model_proto)
    903 if model_proto:
    904   return self.LoadFromSerializedProto(model_proto)
--> 905 return self.LoadFromFile(model_file)

File ~/miniconda3/lib/python3.10/site-packages/sentencepiece/__init__.py:310, in SentencePieceProcessor.LoadFromFile(self, arg)
    309 def LoadFromFile(self, arg):
--> 310     return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

TypeError: not a string

In fact, you can find tokenizer.model in the "files and versions" folder of upstage/llama-30b-instruct, but you can't see it here.
So I think the error is due to the missing tokenizer.model.

wonhosong

upstage org Jul 26, 2023

@YuxinXiao Thanks a lot.

For use_fast=False, it seems to require the tokenizer.model and we uploaded it.
We checked that it loaded without any problem.
Could you give it one more try?

wonhosong changed discussion status to closed Aug 1, 2023

YuxinXiao

Aug 1, 2023

thanks for uploading it! it works fine now.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment