Post
Interesting paper: ๐๐๐๐จ๐ซ๐: ๐ญ๐ซ๐๐ข๐ง ๐๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ ๐จ๐ง ๐๐จ๐ง๐ฌ๐ฎ๐ฆ๐๐ซ-๐ ๐ซ๐๐๐ ๐๐๐๐ฌ ๐ช
It's now possible to ๐๐ช๐ก๐ก๐ฎ ๐ฅ๐ง๐-๐ฉ๐ง๐๐๐ฃ a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!
The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!
The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits
Another technique is to ๐ฅ๐ง๐ค๐๐๐๐ฉ ๐ฉ๐๐ ๐ฌ๐๐๐๐๐ฉ ๐ข๐๐ฉ๐ง๐๐ญ ๐ฉ๐ค ๐ก๐ค๐ฌ๐๐ง-๐ง๐๐ฃ๐ ๐จ๐ฅ๐๐๐๐จ, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.
โก๏ธ Enter the authors of ๐๐ข๐๐ฐ๐ณ๐ฆ: ๐๐ฆ๐ฎ๐ฐ๐ณ๐บ-๐๐ง๐ง๐ช๐ค๐ช๐ฆ๐ฏ๐ต ๐๐๐ ๐๐ณ๐ข๐ช๐ฏ๐ช๐ฏ๐จ ๐ฃ๐บ ๐๐ณ๐ข๐ฅ๐ช๐ฆ๐ฏ๐ต ๐๐ฐ๐ธ-๐๐ข๐ฏ๐ฌ ๐๐ณ๐ฐ๐ซ๐ฆ๐ค๐ต๐ช๐ฐ๐ฏ. They gather (and prove) interesting insights:
โ The weight matrix does not reliably converge to lower ranks during training.
โ But the gradient matrix does!
Based on these insights, ๐๐ต๐ฒ๐ ๐ฏ๐๐ถ๐น๐ฑ ๐๐ฎ๐๐ผ๐ฟ๐ฒ, that projects the gradient to lower ranks.
๐บ๏ธ ๐๐ฟ๐ฒ๐ฎ๐ ๐ถ๐ฑ๐ฒ๐ฎ: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).
๐ค This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).
โก๏ธ ๐๐๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐ Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
๐ช No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) โ this is essential, it confirms that the method is viable!
Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)
It's now possible to ๐๐ช๐ก๐ก๐ฎ ๐ฅ๐ง๐-๐ฉ๐ง๐๐๐ฃ a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!
The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!
The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits
Another technique is to ๐ฅ๐ง๐ค๐๐๐๐ฉ ๐ฉ๐๐ ๐ฌ๐๐๐๐๐ฉ ๐ข๐๐ฉ๐ง๐๐ญ ๐ฉ๐ค ๐ก๐ค๐ฌ๐๐ง-๐ง๐๐ฃ๐ ๐จ๐ฅ๐๐๐๐จ, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.
โก๏ธ Enter the authors of ๐๐ข๐๐ฐ๐ณ๐ฆ: ๐๐ฆ๐ฎ๐ฐ๐ณ๐บ-๐๐ง๐ง๐ช๐ค๐ช๐ฆ๐ฏ๐ต ๐๐๐ ๐๐ณ๐ข๐ช๐ฏ๐ช๐ฏ๐จ ๐ฃ๐บ ๐๐ณ๐ข๐ฅ๐ช๐ฆ๐ฏ๐ต ๐๐ฐ๐ธ-๐๐ข๐ฏ๐ฌ ๐๐ณ๐ฐ๐ซ๐ฆ๐ค๐ต๐ช๐ฐ๐ฏ. They gather (and prove) interesting insights:
โ The weight matrix does not reliably converge to lower ranks during training.
โ But the gradient matrix does!
Based on these insights, ๐๐ต๐ฒ๐ ๐ฏ๐๐ถ๐น๐ฑ ๐๐ฎ๐๐ผ๐ฟ๐ฒ, that projects the gradient to lower ranks.
๐บ๏ธ ๐๐ฟ๐ฒ๐ฎ๐ ๐ถ๐ฑ๐ฒ๐ฎ: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).
๐ค This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).
โก๏ธ ๐๐๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐ Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
๐ช No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) โ this is essential, it confirms that the method is viable!
Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)