Infinite loop loading tokenizer (specific to 33B-GPTQ repo)

#9
by gongy - opened

Hi all, loading the tokenizer leads to an infinite loop with the latest transformers:

 File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
   return self._convert_token_to_id_with_added_voc(tokens)
 File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
   return self.unk_token_id
 File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1150, in unk_token_id
   return self.convert_tokens_to_ids(self.unk_token)
 File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1030, in unk_token
   return str(self._unk_token)
RecursionError: maximum recursion depth exceeded while getting the str of an object

I resolved this by just using the 65B or 7B tokenizer configs:

tokenizer = AutoTokenizer.from_pretrained(
       "TheBloke/guanaco-65B-GPTQ",
       use_fast=True
   )

Is there a reason the 33B repo has a specifically different tokenizer?

Oh, interesting. I never noticed that.

So the 33B tokenizer I got from the model Tim Dettmers merged himself, TimDettmers/guanaco-33b-merged.

He didn't merge the other sizes, so I merged them myself and used the standard Llama base tokenizers for those.

I've compared the 33B and 65B tokenizers and yeah there's a few differences. For example the 33B doesn't list the <s> token here:

image.png

Where 65B does:

image.png

Given you're getting errors, I have removed the 33B tokenizer and replaced it with the file from my 65B repo. Thanks for the report!

@TheBloke I think the file "tokenizer_config.json" is also need to be updated.

Thanks, fixed

Sign up or log in to comment