arxiv:2403.03507

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Published on Mar 6

· Submitted by

akhaliq on Mar 7

#1 Paper of the day

Upvote

180

Authors:

Jiawei Zhao ,

Zhenyu Zhang ,

Beidi Chen ,

Anima Anandkumar ,

Yuandong Tian

Abstract

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

View arXiv page View PDF Add to collection

Community

blackhc

Mar 7

https://unrollnow.com/status/1765699202489651364 for a review/TL;DR thread (or for the original: https://twitter.com/BlackHC/status/17656992024896513640)

taufiqdp

Mar 7

https://github.com/jiaweizzhao/galore

mobinx

Mar 7

We need official github code pls and hf integration... What a cool project

Mariastang

Mar 10

I also would like to see as much source code as possible please. Very much appreciated

Legola

Mar 7

This comment has been hidden

librarian-bot

Mar 8

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

leegao19

Mar 8

Some thoughts / ideas, I don't know if they make sense or not:

Instead of r being a hyperparameter, could this be a threshold on the singular value instead? Even something like using random matrix theory to find the spectral threshold for signal-noise?
Instead of T being a hyperparameter, could you measure how "diagonally" the P_t^T G_t Q_t is? I believe the intuition in the paper is that we'd like to regularly "refresh" the principle directions corresponding to the full-rank supports in case they drift over time. As I understand it, the projected gradient is initially just the diagonal of singular values, and it'll drift away from that structure over time (I'm making a big assumption that this drift is gradual and inversely related with how good P,Q still act as the principle directions). It seems like you can quantify that drift somehow and use it to drive whether or not P,Q are still good principle directions for the gradient updates.

leegao19

Mar 9

For Figure 1.

Could you also include what 8bit-Adam + per-layer weight updates but without the rank-reduction on the gradient update would have affected the memory use? It seems like (based on the Lomo paper / https://arxiv.org/abs/2306.09782) that it'd also significantly reduce that light-green part of the memory use since the gradient is consumed+discarded at each layer immediately?

tydsh

Paper author Mar 10

•

edited Mar 10

Thanks for your comments! We have third party evaluation here: https://github.com/jiaweizzhao/GaLore/issues/6. GaLore alone (without per-layer weight update) has comparable memory reduction as per-layer weight update. They are orthogonal techniques. By combining them together you can run 7B pre-training within 24G memory (e.g., 4090).

Minami-su

Mar 11

Very powerful technology.

derek-thomas

Mar 31

•

edited Mar 31

Incredible paper! Im excited to see how this unfolds overtime. Ive become a fan of LoRA's small update footprint especially for serving. But for some use-cases I can see wanting to have more performance.

Id also be curious to see:

Downstream task performance across diverse tasks/metrics
Memory scenarios for common use-cases. How much of a benefit do I get from GaLore vs LoRA or others, or are they all pretty similar.

blanchon

Jun 9

GaLore: Revolutionizing LLM Training with Memory-Efficient Gradient Projections

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

psando

5 days ago

•

edited 5 days ago

Looking for clarification on Figure 1: what is batch size, sequence len, and vocab size here? Because I would expect activations to take up more space...

batch size seems to be 256 based on Fig. 1 caption
sequence len seems to be 2048, based on footnote 1
vocab size is 32000, based on config from repo
bf16 used so 2 bytes per float, based on footnote 2

So only the logits of the model should take up 256 * 2048 * 32000 * 2 bytes or 31.25 GB. Where is this required memory in Figure 1?

Thanks!

Sorenmc

about 1 hour ago

It seems like it would be possible to combine Lora and galore (and their quantized counterparts qlora and qgalore) to further reduce memory footprint by using galore on the gradients for lora matrix A and B. Has anyone tried to experiment with this? I can't find the information in the paper as they mostly view their work as a complete alternative to lora.