Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

#161
by Naiel - opened

Hi, I started encountering an issue with the running the "AutoTokenizer.from_pretrained" function using this checkpoint. Installed transformers version 4.34.0, this issue was not available before, how I can overcome this issue? Thank you.


Exception Traceback (most recent call last)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", add_prefix_space=True)

File /python/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:751, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
747 if tokenizer_class is None:
748 raise ValueError(
749 f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
750 )
--> 751 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
753 # Otherwise we have to be creative.
754 # if model is an encoder decoder, the encoder tokenizer class is used by default
755 if isinstance(config, EncoderDecoderConfig):

File /python/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2045, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
2042 else:
2043 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2045 return cls._from_pretrained(
2046 resolved_vocab_files,
2047 pretrained_model_name_or_path,
2048 init_configuration,
2049 *init_inputs,
2050 token=token,
2051 cache_dir=cache_dir,
2052 local_files_only=local_files_only,
2053 _commit_hash=commit_hash,
2054 _is_local=is_local,
2055 **kwargs,
2056 )

File /python/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2256, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
2254 # Instantiate the tokenizer.
2255 try:
-> 2256 tokenizer = cls(*init_inputs, **init_kwargs)
2257 except OSError:
2258 raise OSError(
2259 "Unable to load vocabulary from file. "
2260 "Please check that the provided vocabulary is accessible and not corrupted."
2261 )

File /python/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py:122, in LlamaTokenizerFast.init(self, vocab_file, tokenizer_file, clean_up_tokenization_spaces, unk_token, bos_token, eos_token, add_bos_token, add_eos_token, use_default_system_prompt, **kwargs)
109 def init(
110 self,
111 vocab_file=None,
(...)
120 **kwargs,
121 ):
--> 122 super().init(
123 vocab_file=vocab_file,
124 tokenizer_file=tokenizer_file,
125 clean_up_tokenization_spaces=clean_up_tokenization_spaces,
126 unk_token=unk_token,
127 bos_token=bos_token,
128 eos_token=eos_token,
129 use_default_system_prompt=use_default_system_prompt,
130 **kwargs,
131 )
132 self._add_bos_token = add_bos_token
133 self._add_eos_token = add_eos_token

File /python/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:111, in PreTrainedTokenizerFast.init(self, *args, **kwargs)
108 fast_tokenizer = copy.deepcopy(tokenizer_object)
109 elif fast_tokenizer_file is not None and not from_slow:
110 # We have a serialization from tokenizers which let us directly build the backend
--> 111 fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
112 elif slow_tokenizer is not None:
113 # We need to convert a slow tokenizer to build the backend
114 fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

Hey, I am seeing the same issue, my model has been working fine for 4 months, no problems, and now failing to build during the last week, seeing this same error. @Naiel have you been able to resolve?? Thanks for your post.

Mistral AI_ org

Hey, I am seeing the same issue, my model has been working fine for 4 months, no problems, and now failing to build during the last week, seeing this same error. @Naiel have you been able to resolve?? Thanks for your post.

Be sure to update it to the most recent transformer versions since there were a lot of updates !

THANK YOU @pandora-s for this response. It is working now with the recent transformer update. Much appreciated. @Naiel hope this works for you too.

Hi Everyone, thanks for the efforts. I am glad to hear some resolved this error. Still I'm facing the same error for "AutoTokenizer.from_pretrained" after updating transformers package using "pip install transformers -U", "pip install transformers". I use azure databricks to run the model. Not sure how to fix this. Thank you.

Could you try using transformers>=4.40, <4.42?
This works for me when using the Instruct version

Sign up or log in to comment