Big jump between IQ2_XXS and IQ2_M - any chance of IQ2_XS quants?

#1
by smcleod - opened

Hey bartowski, love your work!

I noticed you're no longer doing IQ2_XS quants of 70B+ models, I totally get you can only create so many quant types but I was wondering if there was also a quality reason for this?

The reason I ask is that there is quite a gap between IQ2_XXS and IQ2_M, which is within a range quite well suited for folks trying to fit 65-75b~ models along with a decent context size on 32-40GB VRAM.

SCR-20240614-iwdz.png

The way I read this spread (and I absolutely could be missing something) is that if I'm looking for a combination of the best quality and context size combination:

  • IQ1_M is not really useful quality wise and is basically the same size as IQ2_XXS (personally, I'd don't think they're worth creating).
  • IQ2_XXS is too large for folks with 24GB VRAM but quite a bit lower in quality than IQ2_M.
  • IQ2_M is almost the same size as IQ3_XXS which is preferable and requires an additional 3.84GB of VRAM on top of IQ2_XXS.
  • Q2_K might run slightly faster than IQ2_M on non-CUDA machines (although this has improved a lot with recent llama.cpp versions) but at pretty much the same size as IQ3_XXS which is considerably higher quality.

It feels like the sweet spot for 72B might be IQ2_XS, I was wondering if it might be worth dropping the IQ1_M quants and replacing them with IQ2_XS?

I could probably just add IQ2_XS, thanks for the feedback

I was trying to streamline it, would be nice if i could see the per-size downloads to see if anyone uses IQ1_M, but I imagine a lot of the sizes will have someone like you asking for it back so I tried to strike a nice balance :D

IQ2_XS probably fills a pretty large gap as you noted, so i'll try to look at putting that back into rotation

Thanks a bunch, really appreciate your work.

smcleod changed discussion status to closed

Sign up or log in to comment