Why sentencepiece tokenizer

by mnwato - opened

I need to have a mixed language (persian, some english words, numbers and characters). In medium you said that you used sentencepiece tokenizer for this model. Is there any reason for this descision? Why didn't you choose BPE?

bolbolzaban org

Please see more details on the blog posts: https://khashei.medium.com/a-not-so-dangerous-ai-in-the-persian-language-39172a641c84
Also feel free to contact on telegram if you have more questions.

khashei changed discussion status to closed

Sign up or log in to comment