Text Generation
Transformers
PyTorch
English
llama
Inference Endpoints
text-generation-inference

Evenly distribute the model across GPUs?

#32
by Shiba - opened

I am using 4 A100(40GRAM version) for inferencing. However, some GPUs may encounter out of CUDA memory issue after 3 or 4 times of generation. Do you have any suggestions to fix this?

Sign up or log in to comment