Tiktoken and interaction with Transformers

Support for tiktoken model files is seamlessly integrated in 🤗 transformers when loading models from_pretrained with a tokenizer.model tiktoken file on the Hub, which is automatically converted into our fast tokenizer.

Known models that were released with a tiktoken.model :

gpt2
llama3

Example usage

In order to load tiktoken files in transformers, ensure that the tokenizer.model file is a tiktoken file and it will automatically be loaded when loading from_pretrained. Here is how one would load a tokenizer and a model, which can be loaded from the exact same file:

from transformers import AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder="original")

Create tiktoken tokenizer

The tokenizer.model file contains no information about additional tokens or pattern strings. If these are important, convert the tokenizer to tokenizer.json, the appropriate format for PreTrainedTokenizerFast.

Generate the tokenizer.model file with tiktoken.get_encoding and then convert it to tokenizer.json with convert_tiktoken_to_fast.


from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import get_encoding

# You can load your custom encoding or the one provided by OpenAI
encoding = get_encoding("gpt2")
convert_tiktoken_to_fast(encoding, "config/save/dir")

The resulting tokenizer.json file is saved to the specified directory and can be loaded with PreTrainedTokenizerFast.

tokenizer = PreTrainedTokenizerFast.from_pretrained("config/save/dir")

< > Update on GitHub