Tokenizer code not working? `AttributeError: 'BlueLMTokenizer' object has no attribute 'sp_model'`

#1
by TheBloke - opened

Trying to load your tokenizer using the provided example code gives this error: AttributeError: 'BlueLMTokenizer' object has no attribute 'sp_model'

Tested with: transformers==4.34.1

Full log:

In [1]: from transformers import AutoModelForCausalLM, AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("vivo-ai/BlueLM-7B-Chat-32K", trust_remote_code=True, use_fast=False)
A new version of the following files was downloaded from https://huggingface.co/vivo-ai/BlueLM-7B-Chat-32K:
- tokenization_bluelm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = AutoTokenizer.from_pretrained("vivo-ai/BlueLM-7B-Chat-32K", trust_remote_code=True, use_fast=False)

File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:738, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    736     if os.path.isdir(pretrained_model_name_or_path):
    737         tokenizer_class.register_for_auto_class()
--> 738     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    739 elif config_tokenizer_class is not None:
    740     tokenizer_class = None

File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2017, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
   2014     else:
   2015         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2017 return cls._from_pretrained(
   2018     resolved_vocab_files,
   2019     pretrained_model_name_or_path,
   2020     init_configuration,
   2021     *init_inputs,
   2022     token=token,
   2023     cache_dir=cache_dir,
   2024     local_files_only=local_files_only,
   2025     _commit_hash=commit_hash,
   2026     _is_local=is_local,
   2027     **kwargs,
   2028 )

File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2249, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   2247 # Instantiate the tokenizer.
   2248 try:
-> 2249     tokenizer = cls(*init_inputs, **init_kwargs)
   2250 except OSError:
   2251     raise OSError(
   2252         "Unable to load vocabulary from file. "
   2253         "Please check that the provided vocabulary is accessible and not corrupted."
   2254     )

File /workspace/huggingface/modules/transformers_modules/vivo-ai/BlueLM-7B-Chat-32K/1b474dbc96f42f94289eafd42d7a582a436f87ba/tokenization_bluelm.py:76, in BlueLMTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, **kwargs)
     74 unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
     75 pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
---> 76 super().__init__(
     77     bos_token=bos_token,
     78     eos_token=eos_token,
     79     unk_token=unk_token,
     80     pad_token=pad_token,
     81     add_bos_token=add_bos_token,
     82     add_eos_token=add_eos_token,
     83     sp_model_kwargs=self.sp_model_kwargs,
     84     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
     85     **kwargs,
     86 )
     87 self.vocab_file = vocab_file
     88 self.add_bos_token = add_bos_token

File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils.py:367, in PreTrainedTokenizer.__init__(self, **kwargs)
    363 super().__init__(**kwargs)
    365 # 4. If some of the special tokens are not part of the vocab, we add them, at the end.
    366 # the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
--> 367 self._add_tokens(
    368     [token for token in self.all_special_tokens_extended if token not in self._added_tokens_encoder],
    369     special_tokens=True,
    370 )
    372 self._decode_use_source_tokenizer = False

File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils.py:467, in PreTrainedTokenizer._add_tokens(self, new_tokens, special_tokens)
    465     return added_tokens
    466 # TODO this is fairly slow to improve!
--> 467 current_vocab = self.get_vocab().copy()
    468 new_idx = len(current_vocab)  # only call this once, len gives the last index + 1
    469 for token in new_tokens:

File /workspace/huggingface/modules/transformers_modules/vivo-ai/BlueLM-7B-Chat-32K/1b474dbc96f42f94289eafd42d7a582a436f87ba/tokenization_bluelm.py:110, in BlueLMTokenizer.get_vocab(self)
    108 def get_vocab(self):
    109     """Returns vocab as a dict"""
--> 110     vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
    111     vocab.update(self.added_tokens_encoder)
    112     return vocab

File /workspace/huggingface/modules/transformers_modules/vivo-ai/BlueLM-7B-Chat-32K/1b474dbc96f42f94289eafd42d7a582a436f87ba/tokenization_bluelm.py:106, in BlueLMTokenizer.vocab_size(self)
    103 @property
    104 def vocab_size(self):
    105     """Returns vocab size"""
--> 106     return self.sp_model.get_piece_size()

AttributeError: 'BlueLMTokenizer' object has no attribute 'sp_model'
vivo AI Lab org

You can try transformers==4.33.1.

You can try transformers==4.33.1.

Done. I successed.

Are you planning to fix it for 4.34.1 ? Otherwise this is very limiting for users - most people want to be on the latest Transformers. And this will keep getting more important as new Transformers releases coming out (there's going to be another Transformers release in the next day or two)

vivo AI Lab org

Moving the call to super().init() to a line after the creation of self.sp_model in tokenization_bluelm.py could resolve the issue.

JoeyHeisenberg changed discussion status to closed

Sign up or log in to comment