Encoding issue
#2
by
bakrianoo
- opened
@bakrianoo
Check Better tokenization for Arabic Text
This problem arises with BPE tokenizers because they default to the bytes of each character and those get encoded to the weird latin characters you are seeing.
💡@MohamedRashad, why don't you make Better tokenization for Arabic Text
checked by default?
@SaiedAlshahrani
Because this way sometimes introduce errors when tokenizers other than BPE are used, so i defaulted to not checking it as BPE should be the special case not the general one.
Makes sense! Great work, anyway :-)
@SaiedAlshahrani
Thank You
MohamedRashad
changed discussion status to
closed