Is there a best way to infer this model from multiple small memory GPUs?
#39
by
hongdouzi
- opened
I have 4 3090s, their total memory is 96GB, which framework should I use to infer this model most efficiently?
hongdouzi
changed discussion title from
Is there a best way to infer this model from multiple small memory GPUs
to Is there a best way to infer this model from multiple small memory GPUs?
vllm/aphrodite .. load-in-4bit with 64k ctx