Any chance for an EXL2 version?

by wolfram - opened Nov 19, 2023

Nov 19, 2023

@LoneStriker and @Panchovix quantized EXL2 versions of Goliath 120B. Any chance we could get an EXL2 version of this model, too? It's tuned on top of Goliath and in my tests it's also at the top of my rankings!

LoneStriker

Nov 19, 2023

@LoneStriker and @Panchovix quantized EXL2 versions of Goliath 120B. Any chance we could get an EXL2 version of this model, too? It's tuned on top of Goliath and in my tests it's also at the top of my rankings!

I can put it in the queue. 120Bs are monsters to chew through though.

Panchovix

Nov 19, 2023

If LoneStriker takes in his queue then nice! And I can confirm, it takes a good while to do a quant exl2 of a 120B model.

wolfram

Nov 19, 2023

Thanks to both of you! 120B is definitely a beast, but even down to 3-bit it's still beating all the 70Bs and with EXL2 it even runs nicely fast. So looking forward to this, thanks a lot!

LoneStriker

Nov 20, 2023

Various bit rates should appear here (3.0 and 4.5 are done, other variants still quantizing):
https://huggingface.co/models?search=lonestriker%20tess-xl

migtissera

Owner Nov 20, 2023

Awesome job! Can we also get a sample script on how to run it with EXL2?

LoneStriker

Nov 20, 2023

Awesome job! Can we also get a sample script on how to run it with EXL2?

Most people will just run it under ooba text gen webui using the exllamav2 loader. The one setting you may need, however, is to unset this option:

For low-bit-rate quants, the ooba web interface settings will spit out gibberish if left checked (it's on by default.)

If you want to use it programmatically in Python, you can use Turboderp's exllamav2 project. Script here:
https://github.com/turboderp/exllamav2/blob/master/examples/chat.py

You can run the 3.0bpw quant on 2x 3090s/4090s at a reasonably fast inference speed.

Panchovix

Nov 20, 2023

At least what I do, for easier testing is, using exui (same developer of exllama) and after getting the model directly, loading it with that UI.

https://github.com/turboderp/exui

The original exllamav2 project to do some benchmarks for example, or to install from source to use in other backends: https://github.com/turboderp/exllamav2

LoneStriker

Nov 20, 2023

exui is recommended (though very few people know about it yet.). One benefit is that you can use speculative decoding as @Panchovix shows above. You basically get 50-100% speedup for a small bit of VRAM to run the draft model.

jackboot

Nov 20, 2023

The 3.0 has been saving my butt. They give normal 70b speeds and what feels like 80-90% of the quality of 4KS/4KM GGUF. Only 3500 or so context though.

wolfram

Nov 20, 2023

Wow, that was faster than I expected. Thanks a lot, @LoneStriker !

And thanks for the recommendation of exui, @Panchovix - speculative decoding sounds interesting and useful. But that UI doesn't have an API, does it? My frontend is SillyTavern so I need a backend that can be used with it, either OpenAI API-compatible or e. g. ooba text gen webui (which is now OpenAI API-compatible, too).

Panchovix

Nov 20, 2023

@wolfram in that case I suggest to use tabbyAPI https://github.com/theroyallab/tabbyAPI

It is a very lightweight API loader for exllamav2/gptq models, and it will work with ST.

Though it isn't as easy as ooba.

wolfram

Nov 20, 2023

@Panchovix Why not ooba? Does tabbyAPI support speculative decoding or what would be the advantage?

LoneStriker

Nov 20, 2023

ooba does not support exl2 speculative decoding. exui and tabbyAPI both support it.

jackboot

Nov 21, 2023

3.0bpw barely fits in 48gb. Even the tiniest model won't work with that.

LoneStriker

Nov 21, 2023

Yup, 3.0 is a tight fit, even with 8-bit cache enabled. If you want to go lower, grab a 2.18, 2.4 or 2.85 bpw model:
https://huggingface.co/LoneStriker?search_models=tess-xl

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment