Is it possible to train this model on a commercially available cloud machine?
#19
by
Walexum
- opened
I've been trying to train with hugging face accelerator on a machine the 4 A100s with 80gig each and I keep getting CUDA out of memory error. I've tried implementing every optimization I can find. Is this actually possible?
you probably do not have enough memory to train a model this big, with optimizer state, I'd expect 8x 80GB A100s to be the minimum. You can probably do PEFT though https://github.com/huggingface/peft