Is it possible to train this model on a commercially available cloud machine?

#19
by Walexum - opened

I've been trying to train with hugging face accelerator on a machine the 4 A100s with 80gig each and I keep getting CUDA out of memory error. I've tried implementing every optimization I can find. Is this actually possible?

you probably do not have enough memory to train a model this big, with optimizer state, I'd expect 8x 80GB A100s to be the minimum. You can probably do PEFT though https://github.com/huggingface/peft

Sign up or log in to comment