Encoding issue

#2
by bakrianoo - opened

SA
Can you check why the output of the tokenizers is displayed like the one below

image.png

@bakrianoo
Check Better tokenization for Arabic Text
This problem arises with BPE tokenizers because they default to the bytes of each character and those get encoded to the weird latin characters you are seeing.

💡@MohamedRashad, why don't you make Better tokenization for Arabic Text checked by default?

@SaiedAlshahrani
Because this way sometimes introduce errors when tokenizers other than BPE are used, so i defaulted to not checking it as BPE should be the special case not the general one.

Makes sense! Great work, anyway :-)

@SaiedAlshahrani
Thank You

MohamedRashad changed discussion status to closed

Sign up or log in to comment