meta-llama/Llama-3.2-90B-Vision-Instruct · How to use model across multiple GPUs

Hello,

Thank you for sharing this model. This may be a basic question but I have two A100 GPUs with 80 and 40GB of VRAM respectively and I want to use the model mainly for inference. I know I cannot fit the full 16 bit model on my setup but there is a version available that has 8 bit quantized weights. The parameters for that model are 95GB and can ideally fit on my setup with a relatively smaller context window. But I am unclear on how I can split the model across GPUs so that I can do inference since the model cannot fit on one GPU in any scenario.

Any pointers would be appreciated.

Thank you!