aikitoria/Goliath-longLORA-120b-rope8-32k-exl2 · Thank you for posting memory usage

Thanks! I had calculated the exact bpw needed to hit these memory sizes in advance, and then the tests were to verify that it worked :)

You can do it for your own quants fairly easily:

Start with any size quant of the model size you want (i.e. any exl2 70b, or 120b, or whatever).

Load the model on an idle A100 80GB with zero context size. The value you get in nvidia-smi will be your base memory size, this scales pretty much linearly with the bpw of the quant.

Then load it again, but at the different useful context sizes and context bit widths you care about (i.e. 16k/32k x 8/16 bit).
The amount this uses above the base memory size is how much space the context takes up at that configuration, this remains unchanged for any bpw.

Then you can easily put it together: max bpw = (target mem - context size) / base size * base bpw.
Give it like 1GiB slack to account for allocation weirdness, only being able to split between GPUs at layer borders, and people running desktop environments.