Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

#161

by Naiel - opened Jul 17, 2024

Jul 17, 2024

Hi, I started encountering an issue with the running the "AutoTokenizer.from_pretrained" function using this checkpoint. Installed transformers version 4.34.0, this issue was not available before, how I can overcome this issue? Thank you.

Exception Traceback (most recent call last)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", add_prefix_space=True)

File /python/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:751, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
747 if tokenizer_class is None:
748 raise ValueError(
749 f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
750 )
--> 751 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
753 # Otherwise we have to be creative.
754 # if model is an encoder decoder, the encoder tokenizer class is used by default
755 if isinstance(config, EncoderDecoderConfig):

File /python/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2045, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, *init_inputs, **kwargs)
2042 else:
2043 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2045 return cls._from_pretrained(
2046 resolved_vocab_files,
2047 pretrained_model_name_or_path,
2048 init_configuration,
2049 *init_inputs,
2050 token=token,
2051 cache_dir=cache_dir,
2052 local_files_only=local_files_only,
2053 _commit_hash=commit_hash,
2054 _is_local=is_local,
2055 **kwargs,
2056 )

File /python/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2256, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
2254 # Instantiate the tokenizer.
2255 try:
-> 2256 tokenizer = cls(*init_inputs, **init_kwargs)
2257 except OSError:
2258 raise OSError(
2259 "Unable to load vocabulary from file. "
2260 "Please check that the provided vocabulary is accessible and not corrupted."
2261 )

File /python/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py:122, in LlamaTokenizerFast.init(self, vocab_file, tokenizer_file, clean_up_tokenization_spaces, unk_token, bos_token, eos_token, add_bos_token, add_eos_token, use_default_system_prompt, **kwargs)
109 def init(
110 self,
111 vocab_file=None,
(...)
120 **kwargs,
121 ):
--> 122 super().init(
123 vocab_file=vocab_file,
124 tokenizer_file=tokenizer_file,
125 clean_up_tokenization_spaces=clean_up_tokenization_spaces,
126 unk_token=unk_token,
127 bos_token=bos_token,
128 eos_token=eos_token,
129 use_default_system_prompt=use_default_system_prompt,
130 **kwargs,
131 )
132 self._add_bos_token = add_bos_token
133 self._add_eos_token = add_eos_token

File /python/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:111, in PreTrainedTokenizerFast.init(self, *args, **kwargs)
108 fast_tokenizer = copy.deepcopy(tokenizer_object)
109 elif fast_tokenizer_file is not None and not from_slow:
110 # We have a serialization from tokenizers which let us directly build the backend
--> 111 fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
112 elif slow_tokenizer is not None:
113 # We need to convert a slow tokenizer to build the backend
114 fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

andyearle

Jul 19, 2024

Hey, I am seeing the same issue, my model has been working fine for 4 months, no problems, and now failing to build during the last week, seeing this same error. @Naiel have you been able to resolve?? Thanks for your post.

pandora-s

Mistral AI_ org Jul 19, 2024

Hey, I am seeing the same issue, my model has been working fine for 4 months, no problems, and now failing to build during the last week, seeing this same error. @Naiel have you been able to resolve?? Thanks for your post.

Be sure to update it to the most recent transformer versions since there were a lot of updates !

andyearle

Jul 19, 2024

THANK YOU @pandora-s for this response. It is working now with the recent transformer update. Much appreciated. @Naiel hope this works for you too.

Naiel

Jul 23, 2024

Hi Everyone, thanks for the efforts. I am glad to hear some resolved this error. Still I'm facing the same error for "AutoTokenizer.from_pretrained" after updating transformers package using "pip install transformers -U", "pip install transformers". I use azure databricks to run the model. Not sure how to fix this. Thank you.

Zuo2197

Aug 5, 2024

Could you try using transformers>=4.40, <4.42?
This works for me when using the Instruct version

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment