Strange tokenz

#11
by Chris4K - opened

In the vocab https://huggingface.co/BAAI/bge-small-en-v1.5/raw/main/tokenizer.json
I see:

  "ք": 1239,
  "־": 1240,
  "א": 1241,

  "ת": 1267,
  "،": 1268,
  "ء": 1269,
  "ا": 1270,

....

  "ی": 1309,
  "ے": 1310,
  "अ": 1311,
  "आ": 1312,

I wonder why is this done. And what effect does this have?

Maybe someone knows. Seems to be on more vocabs.

...
Christof

Sign up or log in to comment