Possible to do inference on long contexts with limited VRAM?
I'm doing inference with command-r-plus-4bit on four A10G GPUs totaling 96GiB VRAM. It works fine, but at around 4k context size I get OutOfMemoryError. This is with device_map='auto', which leaves a few gigs of VRAM free on each GPU.
I have some understanding that inference is only done using GPU 0, and so using device_map='balanced_low_0' will try to fill the other GPUs first and leave GPU 0 with free VRAM. I tried this and confirmed that GPUs 1-3 were being filled up to the max, and GPU 0 was practically unused. However, in this configuration inference doesn't work at all. I get OutOfMemoryError if I give it any prompt long or short, like "Hello" (which normally produces just a few token long response). Am I doing something wrong with this configuration?
How much VRAM would I need just to do inference on 8k context?
can use in ollama?