Tokenizer output differs from `EleutherAI/pythia-1.3b`

by jon-tow - opened
EleutherAI org

The tokenizer return dict for EleutherAI/pythia-1.3b-deduped contains a token_type_ids attribute unlike any other models in the Pythia suite. Is this intended behavior?

It leads to irregular errors in places like generate calls with tracebacks such as:

      5 inputs = tokenizer(text, return_tensors="pt")
----> 6 model.generate(**inputs)

2 frames
/usr/local/lib/python3.8/dist-packages/transformers/generation/ in _validate_model_kwargs(self, model_kwargs)
    992         if unused_model_args:
--> 993             raise ValueError(
    994                 f"The following `model_kwargs` are not used by the model: {unused_model_args} (note: typos in the"
    995                 " generate arguments will also show up in this list)"

ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)


jon-tow changed discussion title from Tokenizer output differs from `EleutherAI/pythia-1.3b-deduped` to Tokenizer output differs from `EleutherAI/pythia-1.3b`
EleutherAI org

This is surprising to me. @hails , are you aware of this?

EleutherAI org

That's very strange, I don't know what's up with this! Looking at it I see that this model hasn't had the special_tokens_map.json file filled out, so I must have not pushed the GPT-NeoX-20b tokenizer (just the NeoX tokenizer from the json file we have internally, which doesn't keep all this info like special tokens for some reason when you save it as a PretrainedTokenizer). Will push the additional files when I add the rest of the checkpoints tmrw!

There have been a couple weird behaviors with the tokenizer saving/uploading from JSON files/from HF. merges.txt also wasn't added to any of these repos, though it exists for NeoX-20b's tokenizer.

If the pythia-1.3b model doesn't give this error, then the above will fix it!

EleutherAI org


hails changed discussion status to closed

Sign up or log in to comment