load_in_8bit fine-tuning requires more memory than this notebook

#6
by petermills - opened

I found, and was using this example before I found out about load_in_8bit. It worked and I was able to fine-tune the model on colab.
After fine-tuning and save_pretrained, I realised that I was unable to load the fine-tuned model in another notebook using from_pretrained discovering that there were version issues with pytorch and transformers.
I've been trying to use load_in_8bit to fine-tune however it fills the gpu memory and crashes as soon as the training loop starts.
What's the difference between this notebook and load_in_8bit?
Is it LoRA, and how could this be implemented with load_in_8bit?

Thanks

TL;DR

  • load_in_8bit does forward pass faster, especially for small batches || this implementation is slower because it needs to de-quantize weights, while load_in_8bit runs forward pass with quantized weights
  • load_in_8bit currently requires Turing GPUs or newer (e.g. colab T4 or 2080 are fine, colab K80 or 1080Ti are not) || this implementation works with any GPU or CPU
  • load_in_8bit currently supports only forward pass, i.e. no finetuning, BUT they are working on LoRA implementation there and will post update in a few weeks.

Is it LoRA, and how could this be implemented with load_in_8bit?

Currently, it requires some coding:

  • please install the latest bitsandbytes (i.e. this week's version)
  • write a LoRA wrapper around bnb.nn.Linear8bitLt
    -- in this wrapper, make sure you pass has_fp16_weights=True and memory_efficient_backward=True (see example test)
  • use your wrapped layer instead of standard bnb.nn.Linear8bitLt

Or wait for a couple of weeks till bnb and HF guys do that for you ;)

Sign up or log in to comment