great work!

by MaziyarPanahi - opened Apr 12, 2024

Apr 12, 2024

This is a great work! Given the limited number of A100/80G and only by running it for 100 minutes makes it very interesting! Just out of curiosity, did you use accelerate to launch axolotl and load the model on each GPU or you used python to launch it and shard the model on all GPUs? (I can't find a way to use your config and not get OOM on my 4 A100/80G)

jonasaise

Jun 15, 2024

I have the exact same question, have you got a better answer now a couple of months later?

MaziyarPanahi

Jun 16, 2024

Yes, you can use DeepSpeed (zero2.json) or FSDP. Both make this possible, but the base model supports up to 65k seq length, this fine-tuned max at 2k. Clearly, there is a need for much more compute if one wants to go higher for the seq length.

jonasaise

Jun 17, 2024

Thank you for the answer. Have you tried finetuning with full seq length too? Just curious about how much might be needed.

MaziyarPanahi

Jun 17, 2024

Unfortunately, I couldn't find a good SFT dataset that has a very long text with high-quality. That's the primarily issue when it comes to long-context fine-tuning, and I am pretty sure it will require much more memory.

jonasaise

Jun 18, 2024

Yes, I'm guessing the eqivalence of around 8 H100 :-)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment