TheBloke/guanaco-33B-GPTQ · Infinite loop loading tokenizer (specific to 33B-GPTQ repo)

Jun 1, 2023

Hi all, loading the tokenizer leads to an infinite loop with the latest transformers:

 File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
   return self._convert_token_to_id_with_added_voc(tokens)
 File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
   return self.unk_token_id
 File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1150, in unk_token_id
   return self.convert_tokens_to_ids(self.unk_token)
 File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1030, in unk_token
   return str(self._unk_token)
RecursionError: maximum recursion depth exceeded while getting the str of an object

I resolved this by just using the 65B or 7B tokenizer configs:

tokenizer = AutoTokenizer.from_pretrained(
       "TheBloke/guanaco-65B-GPTQ",
       use_fast=True
   )

Is there a reason the 33B repo has a specifically different tokenizer?

TheBloke

Owner Jun 5, 2023

Oh, interesting. I never noticed that.

So the 33B tokenizer I got from the model Tim Dettmers merged himself, TimDettmers/guanaco-33b-merged.

He didn't merge the other sizes, so I merged them myself and used the standard Llama base tokenizers for those.

I've compared the 33B and 65B tokenizers and yeah there's a few differences. For example the 33B doesn't list the <s> token here:

Where 65B does:

Given you're getting errors, I have removed the 33B tokenizer and replaced it with the file from my 65B repo. Thanks for the report!

buchylx

Jun 29, 2023

@TheBloke I think the file "tokenizer_config.json" is also need to be updated.

TheBloke

Owner Jun 29, 2023

Thanks, fixed