Tokenizer?

#4
by Reza2kn - opened

Thanks for your work! I'm only a bit confused as it looks like the Tokenizer.model file that is uploaded is the same as the Llama 2 Tokenizer, NOT the one mentioned in the paper with +10,000 added Persian tokens.. I've verified that the contents inside are identical. The other .JSON files related to the added or special tokens are also very short. I'm just looking for your new tokenizer and would appreciate any help!
Thanks!

University of Tehran org

Thanks for your appreciation! I compared our tokenizer.model file with the actual tokenizer.model file of Llama2, and they are indeed different. Our tokenizer.model file contains 89,449 lines, while Llama2's tokenizer.model file has 70,285 lines.

Sign up or log in to comment