no tokenizer.json

#2
by Eruuu - opened

no tokenizer.json

Owner

There is indeed no tokenizer.json in this repo. There is no tokenizer.json in the repo for the base model I tuned this off of either. While I haven't personally used this unquantized version for inference, I know for a fact it is (or at the very least, was) possible to quantize it.

Which stack are you using that insists such a file is required?

I don't have any plan to quantize or finetune the model, yet, at least. But there's this platform where I want to host the model at and it requires the tokenizer.json in order to run the model, that's why i require the file

Thank you!

could you add it, by any chance?

Since it's missing from the base model as well it will take some effort to track down the proper file. I'll try to get this done during the coming week, though.

I've been through this recently when finetuning WizardLM.

The tokenizer.json is missing because the base model uses the slow_tokenizer, which has the 3 separate files.

You can build the fast_tokenizer for your inference engine like this:

from transformers import AutoTokenizer
from transformers.convert_slow_tokenizer import convert_slow_tokenizer
import os

# Load the slow tokenizer
BASE_MODEL = "rAIfle/SorcererLM-8x22b-bf16"
slow_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=False)

# Convert to fast tokenizer
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)

# Create directory if it doesn't exist
os.makedirs("fast_tokenizer", exist_ok=True)
os.makedirs("slow_tokenizer", exist_ok=True)

# Save the fast tokenizer as tokenizer.json
fast_tokenizer.save("fast_tokenizer/tokenizer.json")

# You can also save the other necessary files from the slow tokenizer
slow_tokenizer.save_pretrained("slow_tokenizer")

However, I don't recommend uploading it into the repo, since a lot of tools expect WizardLM2 MoE based models to use the slow_tokenizer, which behaves differently

(padding and special tokens are handled differently, exl2 quant would fail if you put the tokenizer.config in the repo, etc)

Test:

test_text = "The quick brown fox jumps over the lazy dog."

# Test slow tokenizer
slow_tokens = slow_tokenizer.encode(test_text)
print("Slow tokenizer:")
print("- IDs:", slow_tokens[:10])
print("- Tokens:", slow_tokenizer.convert_ids_to_tokens(slow_tokens[:10]))
print("- Decoded:", slow_tokenizer.decode(slow_tokens))

# Test fast tokenizer
fast_encoding = fast_tokenizer.encode(test_text)
print("\nFast tokenizer:")
print("- IDs:", fast_encoding.ids[:10])
print("- Tokens:", fast_tokenizer.decode_batch([fast_encoding.ids[:10]]))
print("- Decoded:", fast_tokenizer.decode(fast_encoding.ids))

Output:

Slow tokenizer:
- IDs: [1, 415, 2936, 9060, 285, 1142, 461, 10575, 754, 272]
- Tokens: ['<s>', '▁The', '▁quick', '▁brown', '▁f', 'ox', '▁j', 'umps', '▁over', '▁the']
- Decoded: <s>The quick brown fox jumps over the lazy dog.

Fast tokenizer:
- IDs: [415, 2936, 9060, 285, 1142, 461, 10575, 754, 272, 17898]
- Tokens: ['The quick brown fox jumps over the lazy']
- Decoded: The quick brown fox jumps over the lazy dog.

@Eruuu Here's the fast_tokenizer version I created using the above python code:

https://huggingface.co/gghfez/SorcererLM-8x22b-fast_tokenizer/blob/main/tokenizer.json

@gghfez Thanks for the assist, and the clear explanation!

thank you so much @gghfez

rAlfle - no problem, thanks for the model! I've wanted something like this for a while but don't have the compute to train it.

Eruuu - no worries, happy to help.

rAIfle changed discussion status to closed

Sign up or log in to comment