UnicodeDecodeError with ooba's Transformers loader

#4
by wolfram - opened

Hi @ehartford ,

when trying to load this model with ooba's Transformers loader, it only produces this traceback:

2023-12-29 20_05_30-Text generation web UI.png

I've redownloaded the model to make sure it didn't get corrupted during the download. I've also reinstalled ooba and made sure it's fully updated. Even manually updated the Transformers library. Still always the same problem.

Any idea what could be the cause and how to solve it?

it's the vocab section in the tokenizer_config.json

So there seems to be something in there that causes problems on Windows. Until this gets fixed in the file, here's a workaround that was mentioned on Reddit (https://www.reddit.com/r/LocalLLaMA/comments/18tiin9/dolphin_26_mistral_7b/kfgn6kv/):

Within the Control Panel, navigate to the [Clock and Region] section and click on [Region]. In the [Region] window that appears, locate and select the [Administration] tab. On the [Administration] tab, locate and enable the option labelled "Beta: Use Unicode UTF-8 in worldwide languages" within the "Change system locale" section. Reboot not required, just restart ooba.

This should probably be only a temporary workaround as there might be side effects (https://stackoverflow.com/questions/56419639/what-does-beta-use-unicode-utf-8-for-worldwide-language-support-actually-do).

Can we just resave the json as unicode?

@FPHam That was the first thing I tried - resaving tokenizer_config.json (which VSCode recognized as UTF-8) as UTF-8 with BOM, UTF-16 LE, and Western (Windows 1252) since the decoding error appears in cp1252.py. Nothing except the aforementioned workaround worked for me.

What exactly is at position 2149 that's causing such an issue? (And why the hell does Windows 11 still have to deal with native code pages like cp1252 instead of being all-Unicode by now?!)

I think it was some weird ' as I remember... yes, it's always mess with this in windoze - 20 years, still mess

Sign up or log in to comment