tokenization_yi.py

by Stilgar - opened Nov 20, 2023

Nov 20, 2023

Got the following error with oobagooba and trust_remote_code enabled :

OSError: models\LoneStrikerdolphin-2.2-yi-34b-4.0bpw-h6-exl2 does not appear to have a file named tokenization_yi.py. Checkout 'https://huggingface.co/models\LoneStrikerdolphin-2.2-yi-34b-4.0bpw-h6-exl2/None' for available files.

LoneStriker

Owner Nov 21, 2023

Looks like the model was missing the file. Fixed.

Stilgar

Nov 21, 2023

•

edited Nov 21, 2023

Thank you the error is gone,

But now I get another one but probably because my transformers are not the latest :

ValueError: Tokenizer class YiTokenizer does not exist or is not currently imported.

Edit : After trying to import the missing YiTokenizer, I get the following error :

ModuleNotFoundError: No module named 'transformers_modules.LoneStrikerdolphin-2'

Maybe another file is missing...

Stilgar changed discussion status to closed Nov 21, 2023

Thireus

Nov 25, 2023

Using --loader exllamav2 instead of --loader exllamav2_hf resolves this issue.

Stilgar

Nov 25, 2023

•

edited Nov 25, 2023

Yes your right it's work with exllamav2.
The model is using 23.5Go Vram and is really fast on a 4090 (+/- 34 token /s)
BUT :
he hallucinates a lot and does not answer the questions asked.
Ultimately the result is worse than a Mistral 7b.
Unfortunately exl2 is known to favor speed to the detriment of accuracy.

LoneStriker

Owner Nov 25, 2023

I don't believe that's a true statement. exl2 is as accurate as other quantization methods as well as being fast. It also lets you pick exactly the bitrate you want, so you can fit models into specific GPUs (70B at ~2.4bpw fits on a single 3090 or 4090). The Yi models themselves seem to be much more fragile and more prone to go off the rails. Try setting your repetition penalty lower (like 1.0 or close to it.) At lower bit rates, exl2 models need to turn off inserting the bos tokens in ooba. You can try turning that off here as well.

Stilgar

Nov 25, 2023

•

edited Nov 25, 2023

Setting the repetition rate to 1 seems to be better, temperature parameters does not change a lot.

The thing I notice:
If I ask the model in French I get some big mistakes:

By example, I ask about atom constitution and is doing a mix between neutron and neutrino then affect a null charge to electron.
Asking in English (with same parameters) I got different sentences and good result.

Again, I ask in French about a character from a well-known book and the reply is the character does not exist…
In English, the character is found and story has some errors but at least the reply is not too bad and is coherent.

I did not notice this with other model, generaaly french and english sentences are very close (only time to time some word not fully translated)

In any case thank you for your job, I’ll try more exl2 models because it’s clear the speed and Vram used are very efficient.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment