Why does Zephyr require less VRAM than Mistral during training?

by webpolis - opened

I have 2 GPU summing up to 18GB VRAM. When I train Zephyr using 4bits quantization, I have enough room. But, if I do the same with Mistral, I get OOM.

I can't figure out the reason yet.

webpolis changed discussion status to closed

Training for what kind of method? SFT, LoRA, DPO, RLHF?

I am just being curious, might not help out your question thought

Sign up or log in to comment