Exception: data did not match any variant of untagged enum PyDecoderWrapper
Hello! Thanks for your efforts!
When I tried to load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_en")
I received the following error:
Exception: data did not match any variant of untagged enum PyDecoderWrapper at line 1130 column 3
I tried downloading the tokenizer locally too.
Thanks!
it seems tokenizer was trained with older version:
!pip install tokenizers==0.13.4rc2
!wget https://huggingface.co/HPLT/hplt_bert_base_en/resolve/main/tokenizer.json?download=true -O tokenizer.json
from tokenizers import Tokenizer
tok = Tokenizer.from_file('./tokenizer.json')
tok.get_vocab()
this works to a first approximation:
from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
import requests
lang = "en"
response = requests.get(f"https://huggingface.co/HPLT/hplt_bert_base_{lang}/resolve/main/tokenizer.json?download=true")
tokenizer_json = json.loads(response.content)
for item in tokenizer_json['pre_tokenizer']['pretokenizers']:
if 'add_prefix_space' in item and item['type'] == 'Metaspace':
value = item['add_prefix_space']
del(item['add_prefix_space'])
if value:
item['prepend_scheme'] = 'always'
else:
item['prepend_scheme'] = 'never'
for item in tokenizer_json['decoder']['decoders']:
if 'add_prefix_space' in item and item['type'] == 'Metaspace':
value = item['add_prefix_space']
del(item['add_prefix_space'])
if value:
item['prepend_scheme'] = 'always'
else:
item['prepend_scheme'] = 'never'
Tokenizer = Tokenizer.from_str(json.dumps(tokenizer_json))
tokenizer = PreTrainedTokenizerFast(tokenizer_object=Tokenizer)
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
#tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_en")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_en", trust_remote_code=True)
mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)
# should output: '[CLS] It's a beautiful place.[SEP]'
assert tokenizer.decode(output_text[0].tolist()) == "[CLS] It's a beautiful place.[SEP]"
Hi, thank you very much for reporting this issue!
We still need to investigate this further, it looks like there was a breaking change introduced in a recent version of tokenizers
. The issue here seems to be with the Metaspace
decoder, which is now not recognized (that's this part of the error message: data did not match any variant of untagged enum PyDecoderWrapper
). I did a quick fix for this English model by replacing the Metaspace
decoder by a Replace
decoder (they should be equivalent), but it ultimately seems to be caused by a bug in tokenizers
and I will ask the maintainers about it :)
Newer tokenizers does not have Metaspace option 'add_prefix_space':
add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.
but instead 'prepend_scheme':
prepend_scheme (str, optional, defaults to "always") — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Choices: “always”, “never”, “first”. First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).
Probably, correct replacement for the option 'add_prefix_space'=True will be 'prepend_scheme'='first'. But i accidentally replace it to 'always' and it also works.
Thanks for the feedback! It looks like this PR is what causes the problems: https://github.com/huggingface/tokenizers/pull/1476
The issue is that the PR introduced a breaking changes that alters the behavior of the metaspace pretokenizer. This means that using the new tokenizers
can lead to silent bugs. Therefore I reverted my previous "fix" so that loading the model actually fails if you use the most recent versions of the libraries. We recommend you to use tokenizers <0.19
with the HPLT models.
Thanks! These versions work well:
tokenizers==0.15.2
transformers==4.39.3