Text Generation
Transformers
GGUF
English
llama
text-generation-inference

Resource Required to train

#2
by hiiamsid - opened

I am trying to finetune longorca 13b-16K model using lora on multi-gpu i.e. 3 A100 gpu with 80GB memory each. But it is continuously throwing CudaOutOfMemory. So, can you say if this is normal and it requires more memory or maybe my code base has some memory leakage? If it requires more memory can you please give me estimated requirements for finetuning 13b-16k models.

Hmm it not take more than 80gb of memory? I think first put gguf training is not ideal. It mainly uses cpu which will make it extremely slow. I think using something like peft is much more efficient.

@johnwick123forevr I am not doing any optimization like gguf, I am just taking fp32 and only adding the Lora layer. Then I tried to finetune but even 3 GPU *A100 (80GB) is not being sufficient. My main problem is my max_length which should be around 8192, I think this is impacting gpu consumption. How did you tell if 80 GB is sufficient?

I don't know how much extra VRAM is required for training at 8192 as I've not done it personally. But if I was going to try, I would definitely want Flash Attention 2 included, as this reduces VRAM consumption

I would strongly recommend to try the Axolotl training framework: https://github.com/OpenAccess-AI-Collective/axolotl

It supports:

  • Full fine tuning (no LoRA)
  • LoRA
  • qLoRA - quantised, so even less VRAM is needed. Slightly lower quality
  • Flash Attention 2, to reduce VRAM usage from extended context
  • Deepspeed or FSDP offload, which uses RAM instead of VRAM; this can be another way to use less VRAM. Eg use DeepSpeed Zero2 or Zero3

Try a LoRA + Flash Attention 2, or LoRA + Flash Attention 2 + DeepSpeed, and I am sure you will do better.

And if that still fails, try qLoRA instead, and then 80GB will be more than enough VRAM. Even 48GB is enough for 70B qLoRA (at least at 4096 context - maybe you'd need the full 80 for 8192, not sure.)

Yep as Thebloke said, axolotl is great. Also, with Lora training it should easily fit in 80gb vram.

Axolotl should do the same exact same thing and you can train with any dataset as well. You just have to edit a yaml file.

Sign up or log in to comment