Ready for Testing...

I am using 767.44GB of vram right now. @winglian bf16 test shows he is using only 8x80 640GB so not sure what magic he is doing to use so much lower vram. Though I am using trl and he is using axotol.

I have a slight hunch that databricks purposely made the model just big enough to be out of normal 8xA100/H100 range.

Galore may be an option to get it back to normal 8xA100 range but our early tests show galore a lot slower so there may be a trade-off, or we are not testing galore correctly. I haven't personally validated the galore tests.

fahadh4ilyas

Mar 30, 2024

@Qubitium what kind of parallelization engine that you use for using multiple gpu? Deepspeed? or else? And did you do full finetune or lora?

Qubitium

LnL AI org Mar 30, 2024

@fahadh4ilyas Zero parallelization at the moment. Just dumb accelerate/trl integration where model layers spread across multiple gpus but only 1 gpu is particpating in training at any given moment so extremely inefficient. This is our first attempt to train on something that requires more than 1 or 2 gpu for full-finetuning so have not tested out deepspeed yet (it should help).

We only do full-finetuning and not lora/qlora at the moment.

Qubitium

LnL AI org Apr 3, 2024

•

edited Apr 3, 2024

@fahadh4ilyas With optimizer set to paged_adam_8bit memory usage went down to ~670GB of our setup. However, we still reverted back to adam_8bit as the paged_adam_8bit memory pattern was triggering a CUDA/Nvidia issue where UVM process (started by nvidia) which controls gpu/cpu memory sharing is started. This caused training speed to slow down 3x. UVM appears to be unified memory sharing for gpu/cpu that is designed reduce OOM. Not sure how to disable this on linux and paged_adam_8bit triggers this 100% in our setup.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment