Text Generation
Transformers
PyTorch
Safetensors
Finnish
llama
finnish
text-generation-inference

Tokenizer is broken?

#1
by mpasila - opened

I tried loading it in 4bit with bitsandbytes and it gives this error

Traceback (most recent call last):
  File "C:\Users\pasil\text-generation-webui\server.py", line 223, in <module>
    shared.model, shared.tokenizer = load_model(model_name)
                                     ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pasil\text-generation-webui\modules\models.py", line 92, in load_model
    tokenizer = load_tokenizer(model_name, model)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pasil\text-generation-webui\modules\models.py", line 111, in load_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pasil\anaconda3\envs\textgen\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 751, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pasil\anaconda3\envs\textgen\Lib\site-packages\transformers\tokenization_utils_base.py", line 2017, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pasil\anaconda3\envs\textgen\Lib\site-packages\transformers\tokenization_utils_base.py", line 2249, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pasil\anaconda3\envs\textgen\Lib\site-packages\transformers\models\llama\tokenization_llama.py", line 141, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pasil\anaconda3\envs\textgen\Lib\site-packages\transformers\models\llama\tokenization_llama.py", line 166, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "C:\Users\pasil\anaconda3\envs\textgen\Lib\site-packages\sentencepiece\__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pasil\anaconda3\envs\textgen\Lib\site-packages\sentencepiece\__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: not a string
Finnish-NLP org

@mpasila can you try load tokenizer with the AutoTokenizer from transformers? That should work.

@mpasila can you try load tokenizer with the AutoTokenizer from transformers? That should work.

I tried loading it again today and now it's loading it just fine. So I'm not sure what happened last time. Oobabooga's text-generation-webui already does what you suggested so that shouldn't have been the problem.

mpasila changed discussion status to closed

Sign up or log in to comment