Something is wrong with gpt4 tokenizer

#7
by mahnerak - opened

Minimal example: "8274876".

Screenshot 2024-04-28 at 23.45.02.png

Screenshot 2024-04-28 at 23.44.51.png

Note: cl100k_base has regex which splits all the numbers into 3-digit tokens. Does js version implement it?

Owner

Hi there. Which tokenizer are you using? It seems to be working fine on my side. :)

image.png

We did make an update to the tokenizer.json over 2 months ago (see commit), which may have affected this, so you may need to reset the cache if you still have the old version cached. For example, if in chrome:

  1. Open dev tools (F12)
  2. Go to Application Tab
  3. Click "Cache storage" dropdown
  4. Right click "transformers-cache" and click "Delete"

Let me know if that helps!

Thanks
Even after clearing everything (from application tab) the problem persisted :/

But I can confirm that with other devices and Incognito everything works well.

mahnerak changed discussion status to closed

Sign up or log in to comment