Exception: data did not match any variant of untagged enum PyDecoderWrapper

by ymoslem - opened Aug 3, 2024

Aug 3, 2024

Hello! Thanks for your efforts!

When I tried to load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_en")

I received the following error:

Exception: data did not match any variant of untagged enum PyDecoderWrapper at line 1130 column 3

I tried downloading the tokenizer locally too.

Thanks!

tsobolev

Aug 6, 2024

•

edited Aug 6, 2024

it seems tokenizer was trained with older version:

!pip install tokenizers==0.13.4rc2
!wget https://huggingface.co/HPLT/hplt_bert_base_en/resolve/main/tokenizer.json?download=true -O tokenizer.json
    
from tokenizers import Tokenizer

tok = Tokenizer.from_file('./tokenizer.json')
tok.get_vocab()

this works to a first approximation:

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
import requests

lang = "en"
response = requests.get(f"https://huggingface.co/HPLT/hplt_bert_base_{lang}/resolve/main/tokenizer.json?download=true")

tokenizer_json = json.loads(response.content)
    
for item in tokenizer_json['pre_tokenizer']['pretokenizers']:
    if 'add_prefix_space' in item and item['type'] == 'Metaspace':
        value = item['add_prefix_space']
        del(item['add_prefix_space'])
        if value:
            item['prepend_scheme'] = 'always'
        else:
            item['prepend_scheme'] = 'never'
            
for item in tokenizer_json['decoder']['decoders']:
    if 'add_prefix_space' in item and item['type'] == 'Metaspace':
        value = item['add_prefix_space']
        del(item['add_prefix_space'])
        if value:
            item['prepend_scheme'] = 'always'
        else:
            item['prepend_scheme'] = 'never'
            
Tokenizer = Tokenizer.from_str(json.dumps(tokenizer_json))
tokenizer = PreTrainedTokenizerFast(tokenizer_object=Tokenizer)

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

#tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_en")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_en", trust_remote_code=True)

mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)

# should output: '[CLS] It's a beautiful place.[SEP]'
assert tokenizer.decode(output_text[0].tolist()) == "[CLS] It's a beautiful place.[SEP]"

davda54

HPLT org Aug 6, 2024

Hi, thank you very much for reporting this issue!

We still need to investigate this further, it looks like there was a breaking change introduced in a recent version of tokenizers. The issue here seems to be with the Metaspace decoder, which is now not recognized (that's this part of the error message: data did not match any variant of untagged enum PyDecoderWrapper). I did a quick fix for this English model by replacing the Metaspace decoder by a Replace decoder (they should be equivalent), but it ultimately seems to be caused by a bug in tokenizers and I will ask the maintainers about it :)

tsobolev

Aug 6, 2024

Newer tokenizers does not have Metaspace option 'add_prefix_space':

add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

but instead 'prepend_scheme':

prepend_scheme (str, optional, defaults to "always") — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Choices: “always”, “never”, “first”. First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).

Probably, correct replacement for the option 'add_prefix_space'=True will be 'prepend_scheme'='first'. But i accidentally replace it to 'always' and it also works.

davda54

HPLT org Sep 12, 2024

Thanks for the feedback! It looks like this PR is what causes the problems: https://github.com/huggingface/tokenizers/pull/1476

The issue is that the PR introduced a breaking changes that alters the behavior of the metaspace pretokenizer. This means that using the new tokenizers can lead to silent bugs. Therefore I reverted my previous "fix" so that loading the model actually fails if you use the most recent versions of the libraries. We recommend you to use tokenizers <0.19 with the HPLT models.

ymoslem

Sep 13, 2024

Thanks! These versions work well:

tokenizers==0.15.2
transformers==4.39.3

ymoslem changed discussion status to closed Sep 13, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment