Tokenizer code not working? `AttributeError: 'BlueLMTokenizer' object has no attribute 'sp_model'`
#1
by
TheBloke
- opened
Trying to load your tokenizer using the provided example code gives this error: AttributeError: 'BlueLMTokenizer' object has no attribute 'sp_model'
Tested with: transformers==4.34.1
Full log:
In [1]: from transformers import AutoModelForCausalLM, AutoTokenizer
In [2]: tokenizer = AutoTokenizer.from_pretrained("vivo-ai/BlueLM-7B-Chat-32K", trust_remote_code=True, use_fast=False)
A new version of the following files was downloaded from https://huggingface.co/vivo-ai/BlueLM-7B-Chat-32K:
- tokenization_bluelm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 tokenizer = AutoTokenizer.from_pretrained("vivo-ai/BlueLM-7B-Chat-32K", trust_remote_code=True, use_fast=False)
File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:738, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
736 if os.path.isdir(pretrained_model_name_or_path):
737 tokenizer_class.register_for_auto_class()
--> 738 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
739 elif config_tokenizer_class is not None:
740 tokenizer_class = None
File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2017, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
2014 else:
2015 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2017 return cls._from_pretrained(
2018 resolved_vocab_files,
2019 pretrained_model_name_or_path,
2020 init_configuration,
2021 *init_inputs,
2022 token=token,
2023 cache_dir=cache_dir,
2024 local_files_only=local_files_only,
2025 _commit_hash=commit_hash,
2026 _is_local=is_local,
2027 **kwargs,
2028 )
File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2249, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
2247 # Instantiate the tokenizer.
2248 try:
-> 2249 tokenizer = cls(*init_inputs, **init_kwargs)
2250 except OSError:
2251 raise OSError(
2252 "Unable to load vocabulary from file. "
2253 "Please check that the provided vocabulary is accessible and not corrupted."
2254 )
File /workspace/huggingface/modules/transformers_modules/vivo-ai/BlueLM-7B-Chat-32K/1b474dbc96f42f94289eafd42d7a582a436f87ba/tokenization_bluelm.py:76, in BlueLMTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, **kwargs)
74 unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
75 pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
---> 76 super().__init__(
77 bos_token=bos_token,
78 eos_token=eos_token,
79 unk_token=unk_token,
80 pad_token=pad_token,
81 add_bos_token=add_bos_token,
82 add_eos_token=add_eos_token,
83 sp_model_kwargs=self.sp_model_kwargs,
84 clean_up_tokenization_spaces=clean_up_tokenization_spaces,
85 **kwargs,
86 )
87 self.vocab_file = vocab_file
88 self.add_bos_token = add_bos_token
File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils.py:367, in PreTrainedTokenizer.__init__(self, **kwargs)
363 super().__init__(**kwargs)
365 # 4. If some of the special tokens are not part of the vocab, we add them, at the end.
366 # the order of addition is the same as self.SPECIAL_TOKENS_ATTRIBUTES following `tokenizers`
--> 367 self._add_tokens(
368 [token for token in self.all_special_tokens_extended if token not in self._added_tokens_encoder],
369 special_tokens=True,
370 )
372 self._decode_use_source_tokenizer = False
File /workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils.py:467, in PreTrainedTokenizer._add_tokens(self, new_tokens, special_tokens)
465 return added_tokens
466 # TODO this is fairly slow to improve!
--> 467 current_vocab = self.get_vocab().copy()
468 new_idx = len(current_vocab) # only call this once, len gives the last index + 1
469 for token in new_tokens:
File /workspace/huggingface/modules/transformers_modules/vivo-ai/BlueLM-7B-Chat-32K/1b474dbc96f42f94289eafd42d7a582a436f87ba/tokenization_bluelm.py:110, in BlueLMTokenizer.get_vocab(self)
108 def get_vocab(self):
109 """Returns vocab as a dict"""
--> 110 vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
111 vocab.update(self.added_tokens_encoder)
112 return vocab
File /workspace/huggingface/modules/transformers_modules/vivo-ai/BlueLM-7B-Chat-32K/1b474dbc96f42f94289eafd42d7a582a436f87ba/tokenization_bluelm.py:106, in BlueLMTokenizer.vocab_size(self)
103 @property
104 def vocab_size(self):
105 """Returns vocab size"""
--> 106 return self.sp_model.get_piece_size()
AttributeError: 'BlueLMTokenizer' object has no attribute 'sp_model'
You can try transformers==4.33.1.
You can try transformers==4.33.1.
Done. I successed.
Are you planning to fix it for 4.34.1 ? Otherwise this is very limiting for users - most people want to be on the latest Transformers. And this will keep getting more important as new Transformers releases coming out (there's going to be another Transformers release in the next day or two)
Moving the call to super().init() to a line after the creation of self.sp_model in tokenization_bluelm.py could resolve the issue.
JoeyHeisenberg
changed discussion status to
closed