Adds the tokenizer configuration file

by lysandre HF Staff - opened Feb 16, 2024

base: refs/heads/main

←

from: refs/pr/7

Discussion Files changed

-0

lysandre

OpenAI community org Feb 16, 2024

The tokenizer configuration file is missing/incorrect and therefore leading to unforeseen errors after the migration of the canonical models.

Refer to the following issue for more information: transformers#29050

The current failing code is the following:

from transformers import AutoTokenizer

>>> previous_tokenizer = AutoTokenizer.from_pretrained("gpt2-large")
>>> current_tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-large")
>>> print(previous_tokenizer.model_max_length, current_tokenizer.model_max_length)
1000000000000000019884624838656, 1024

This is the result after the fix:

from transformers import AutoTokenizer

>>> previous_tokenizer = AutoTokenizer.from_pretrained("gpt2-large")
>>> current_tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-large")
>>> print(previous_tokenizer.model_max_length, current_tokenizer.model_max_length)
1024, 1024

Adds tokenizer_config.json fileaa1b6cec

lysandre changed pull request status to open Feb 19, 2024

lysandre changed pull request status to merged Feb 19, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment