Weird token in the tokenizer?

#13
by Lambent - opened

I'm looking at the tokenizer.json and saw a strange thing at 922:

      "',": 920,
      "▁out": 921,
      "▁ا": 922,
      "block": 923,
      "ies": 924,
      "lay": 925,
      "▁his": 926,

Is this an error that's possibly causing other issues?

Also here:

      "ien": 1145,
      "IC": 1146,
      "▁ال": 1147,
      "▁/": 1148,
      "str": 1149,
      "▁mu": 1150,

(Manually looking through, not programmatically, so these are just examples)

Also here:

      "ien": 1145,
      "IC": 1146,
      "▁ال": 1147,
      "▁/": 1148,
      "str": 1149,
      "▁mu": 1150,

(Manually looking through, not programmatically, so these are just examples)

those are Arabic, letters and could be just the formatting which is why it appeared so to you.
here I'll try adding Arabic letters after this and see if we can replicate it كما هكد_1147 like so. try to copy it and see.

Not sure if it's replicating, but I think I see what's going on. Mixing right to left and left to right makes it look like the : is in the quotes along with the number, but if I actually highlight it, the structure is not what it looks like.

Google org

Wow TIL 🐣 did not know this could should like this in the vocab!

Google org

thanks for sharing the finding! :)

Sign up or log in to comment