How to use model across multiple GPUs
#28
by
aswad546
- opened
Hello,
Thank you for sharing this model. This may be a basic question but I have two A100 GPUs with 80 and 40GB of VRAM respectively and I want to use the model mainly for inference. I know I cannot fit the full 16 bit model on my setup but there is a version available that has 8 bit quantized weights. The parameters for that model are 95GB and can ideally fit on my setup with a relatively smaller context window. But I am unclear on how I can split the model across GPUs so that I can do inference since the model cannot fit on one GPU in any scenario.
Any pointers would be appreciated.
Thank you!