Tokenizer?

by Reza2kn - opened Feb 23, 2024

Feb 23, 2024

Thanks for your work! I'm only a bit confused as it looks like the Tokenizer.model file that is uploaded is the same as the Llama 2 Tokenizer, NOT the one mentioned in the paper with +10,000 added Persian tokens.. I've verified that the contents inside are identical. The other .JSON files related to the added or special tokens are also very short. I'm just looking for your new tokenizer and would appreciate any help!
Thanks!

PedramR

University of Tehran org Feb 24, 2024

Thanks for your appreciation! I compared our tokenizer.model file with the actual tokenizer.model file of Llama2, and they are indeed different. Our tokenizer.model file contains 89,449 lines, while Llama2's tokenizer.model file has 70,285 lines.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment