problem loading tokenizer

#4
by matatonic - opened

I'm seeing the following issue:

>>> from transformers import AutoTokenizer
>>> model_id = 'echo840/Monkey'
>>> tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 288/288 [00:00<00:00, 2.87MB/s]
tokenization_qwen.py: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 21.3k/21.3k [00:00<00:00, 84.0MB/s]
A new version of the following files was downloaded from https://huggingface.co/echo840/Monkey:
- tokenization_qwen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
qwen.tiktoken: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.56M/2.56M [00:00<00:00, 30.2MB/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 35.0/35.0 [00:00<00:00, 333kB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mashton/.local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 829, in from_pretrained
    return tokenizer_class.from_pretrained(
  File "/home/mashton/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
  File "/home/mashton/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "hf_home/modules/transformers_modules/echo840/Monkey/e12c9762d453211a1f3d8f5545b3bbfd70d4d1b7/tokenization_qwen.py", line 114, in __init__
    super().__init__(**kwargs)
  File "/home/mashton/.local/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "hf_home/modules/transformers_modules/echo840/Monkey/e12c9762d453211a1f3d8f5545b3bbfd70d4d1b7/tokenization_qwen.py", line 217, in _add_tokens
    if surface_form not in SPECIAL_TOKENS + self.IMAGE_ST:
AttributeError: 'QWenTokenizer' object has no attribute 'IMAGE_ST'

It seems the super().init(**kwargs) is calling _add_tokens() before the self.IMAGE_ST is being added.
I tried this with transformers-4.39.2 and 4.40.0.dev0, same result.

matatonic changed discussion title from problem loading tokrnizer to problem loading tokenizer
Owner

Hello, you should either use transformers==4.32.0 or refer to this link for fixing: https://huggingface.co/echo840/Monkey-Chat/discussions/1.

Based on the copyright, I don't think I can share my fix if I do fix it. I'm trying to include support for Monkey into another project but I am not sure if I can re-distribute a modified (fixed) version of Monkey. Will you not fix it? The workaround seems simple enough and should cause no harm to any existing users.
transformers continues to update, I cannot use transformers==4.32.0 for my project and if I can't share a fix... What can others do?

Owner

Thank you! I have resolved the issue. Please give it another try, and if you have any questions, please inform me.

It's working now, thank you! I've support for it to my project: https://github.com/matatonic/openedai-vision
Congratulations, it works very well and thanks again!

matatonic changed discussion status to closed
Owner

That sounds great! Thank you for your contribution.

Sign up or log in to comment