Broken tokenizer.json file? 17.2mb on Llama 3.3 vs 9mb on Llama 3.1?

#16
by OwenArli - opened

It was known that the latest transformers often breaks the tokenizer.json causing it to be filled with garbage and doubling the size. Finetuned models often have this issue and the fix was copy the original model's tokenizer to replace the doubled one.

Since 3.3 is essentially a further instruct finetune with the same architecture and tokens, does that mean the 17.2mb tokenizer is a broken tokenizer just like the ones seen on open source finetunes?

Would this affect the model's output? I ask as this type of broken tokenizer issue caused by transformers caused my finetunes to be much more repetitive at longer contexts. I assume a temporary fix is just copying over 3.1's tokenizer file? It does seem to work but I have not tested it extensively if it becomes better.

Llama 3.1 70B Instruct tokenizer file and size:

image.png

Llama 3.3 70B Instruct tokenizer file and size:

image.png

Converting the Llama-3.3-70b-instruct's tokenizer.json to a backwards compatible format ends up being identical to the tokenizer.json from Llama-3.1-70B-Instruct :) so yes, replacing it should not affect model quality

here's the conversion/validation script if you're curious:

import json
from huggingface_hub import hf_hub_download

LLAMA_3_3 = "Llama-3.3-70B-Instruct"
LLAMA_3_1 = "Llama-3.1-70B-Instruct"

TOKENIZER_JSON = "tokenizer.json"

# Download tokenizer.json for Llama-3.3
response = hf_hub_download(repo_id=f"meta-llama/{LLAMA_3_3}", filename=TOKENIZER_JSON, local_dir=LLAMA_3_3, repo_type="model")

# Download tokenizer.json for Llama-3.1 to compare
response = hf_hub_download(repo_id=f"meta-llama/{LLAMA_3_1}", filename=TOKENIZER_JSON, local_dir=LLAMA_3_1, repo_type="model")

# # Load 3.3's tokenizer.json
with open(f"{LLAMA_3_3}/{TOKENIZER_JSON}") as f:
    tokenizer_llama_3_3 = json.load(f)

with open(f"{LLAMA_3_1}/{TOKENIZER_JSON}") as f:
    tokenizer_llama_3_1 = json.load(f)

# Check if tokenizer.json is different
if tokenizer_llama_3_3 != tokenizer_llama_3_1:
    print("Llama-3.3's tokenizer.json is different from Llama-3.1's tokenizer.json, running conversion")
else:
    raise Exception("Llama-3.3's tokenizer.json is the same as Llama-3.1's tokenizer.json, no need to convert")


# Convert merges to legacy format
legacy_merges = [" ".join(i) for i in tokenizer_llama_3_3["model"]["merges"]]
# replace merges
tokenizer_llama_3_3["model"]["merges"] = legacy_merges


# Compare against Llama-3.1's tokenizer
if tokenizer_llama_3_3 == tokenizer_llama_3_1:
    print("Conversion successful. New tokenizer.json is the same as Llama-3.1's tokenizer.json")
else:
    raise Exception("Conversion failed. New tokenizer.json is different from Llama-3.1's tokenizer.json")

# save new converted tokenizer.json
with open(f"{LLAMA_3_3}/{TOKENIZER_JSON}", 'w') as f:
    json.dump(tokenizer_llama_3_3, f, indent=2, ensure_ascii=False)
Meta Llama org

Hey @orangetin @OwenArli - Thanks for the discussion, this is due to the pretty-print in the tokenizers: https://github.com/huggingface/transformers/issues/34744#issuecomment-2511167340

In terms of functionalities it should be the same. We'll fix the size in the upcoming release of tokenizers

Hey @orangetin @OwenArli - Thanks for the discussion, this is due to the pretty-print in the tokenizers: https://github.com/huggingface/transformers/issues/34744#issuecomment-2511167340

In terms of functionalities it should be the same. We'll fix the size in the upcoming release of tokenizers

Ah I see. It might be placebo then? Since even our users say they see more repetitions with the broken tokenizer haha. Not sure.

Sign up or log in to comment