Change from 2.6 to 2.55bpw
Firstly thanks for all the exl2 conversions.
But can I ask why you dropped output from your regular 2.6 to 2.55bpw?
The difference may well be insignificant, but I always found 2.6bpw to be a sweet spot on my setup (28GB Vram).
Mixed messages from folks requesting different bpw. I believe 2.55 was enough to get some folks extra context length. 2.6 was originally selected as that was what Turboderp said would fit in a single 24GB VRAM card. I can switch back to 2.6.
If more people are requesting 2.55bpw, then stick with it. Though I would think that 2.4 is a better fit for those with 24GB VRAM, depending on what other programs they have open that are using VRAM.
Whatever you decide I'm very grateful for your quants.
Currently driving the Nous-Capybara-34B-5.0bpw that you posted, and finding it very responsive and coherent. But for some reason Alpaca Instruct is working better than Vicuna via ST frontend.
Anyway take care π
Things are getting even a bit more complicated. If you enable the cache_8bit
option for the ExLlamav2 loader, you can fit even more bits; supposedly only trading speed of inference for more VRAM space (no degradation of quality):
We may have to do another calibration of what the best bpw settings are again with and without the cache_8bit
option.