CohereForAI/c4ai-command-r-plus-4bit · Running on 3x24 GB RAM?

Apr 9

Hello!
I would like to know if it's possible to bring this model to run on a server with 3x RTX 4090. Sure, a model must either be "ready" to split parts of calculations which are undepending on results of simultanously done calculations on one of the other GPUs or the model layers are divided into three parts so that the intermediate result of cuda:0 is send for further calculation to cuda:1 and so on. As I wasn't able to find informations about this I think it is not possible at the moment. Are there plans to offer this? I know that there is a branch which can let the model run on one 24 GB graphic card but I think this will cost some output performance.

Best regards
Marc

P.S.: Very impressed by this work!!

BrunoSE

Apr 11

At least I tried with 4xL4 GPUs (i.e. 96GB VRAM) and it didnt work. Got out of memory error with this 4bit version

Marcophono

Apr 11

@BrunoSE From my research till now this might be a working solution:
https://huggingface.co/pmysl/c4ai-command-r-plus-GGUF in combination with https://github.com/ggerganov/llama.cpp

Another option seems to be https://github.com/ollama/ollama/releases/tag/v0.1.32-rc1

This way of letting run a llm on local (consumer) hardware is new for me so I hoped to get some input here (like you, I think ;)

Best regards
Marc

danabo

Apr 20

At least I tried with 4xL4 GPUs (i.e. 96GB VRAM) and it didnt work. Got out of memory error with this 4bit version

Strange that didn't work for you. I was able to get the 4bit working on four A10G cards totaling 96GiB VRAM. I didn't do anything special. Just loaded the model with AutoModelForCausalLM.from_pretrained(). Note that passing device_map='auto' is important so that all the GPUs are utilized. However, I am getting OOM errors at only moderately long context lengths of around 4k tokens.