Add support for EXL2 4 bit KV cache; switch from metric gigabytes (1e9 bytes) to JEDEC gigabytes (2^30 bytes)

#2

I'm sorry for mashing 2 unrelated issues together in one PR.

  1. I added 4 bit EXL2 cache support by changing the 8 bit cache checkbox to a drop-down that defaults to 16 bit, but 8 bit or 4 bit can be selected. The calculation now uses an int value instead of a conditional statement.

  2. (Concerns lines 168-170 only) Your calculator seemingly over-estimated the memory use because it used metric gigabytes, equal to 1e9 bytes. But VRAM is measured in JEDEC Standard 100B.01 gigabytes, equal to 2^30 bytes. An RTX 4090 has 24 GB = 25.77e9 B memory. This 7.4% difference may seem insignificant, but it is significant when figuring out how big of a model you can squeeze into your GPU. For instance, 22.5 GB is equal 24.16e9 B. The first number suggests that the model will fit in 24 GB VRAM, the other implies it won't.

LGTM, thanks!

NyxKrage changed pull request status to merged

Sign up or log in to comment