Custom 4-bit Finetuning 5-7 times faster inference than QLora

#25
by rmihaylov - opened

Excuse me, some question for you..

  1. What is the different between your falcontune and QLoRA?
  2. What is the different fine tuning (with the new dataset) in Bitsandbytes+peft and your code? Or maybe your script is the simple form of bitsandbytes+peft?
  3. Can I activate 'nf4' (normal four bit float) in the GPTQ?
FalconLLM pinned discussion

Excuse me, some question for you..

I join in the questions!

Doesn't 40b require like 48Gb of VRAM? also if anyone reads this I would be very appreciative for any insight into cost efficient/realistic hardware for ML, it seems like the cheapest build is somewhere in the neighborhood of $5-6k, and I think I would rather have my own hardware than rely on Amazon/Google/Azure, Thanks

Falcon 40b inference in 8bit takes 45gb of ram. On single RTX A6000 48GB (not ADA version) on AMD EPIC 7713 DDR4 pc take around 4 second to generate 20 tokens (words), in 4bit -it takes 25gb ram and 12 second for same 20 tokens - not sure why..

...
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
PATH,
device_map="auto"
trust_remote_code=True,
quantization_config=bnb_config,
)

can anyone help me please
i have the text data stored in .txt the text data is simple information about a technology
i want to fine tune the falcon model and the i want to ask the question to the falcon model according to that .txt file

Falcon 40b inference in 8bit takes 45gb of ram. On single RTX A6000 48GB (not ADA version) on AMD EPIC 7713 DDR4 pc take around 4 second to generate 20 tokens (words), in 4bit -it takes 25gb ram and 12 second for same 20 tokens - not sure why..

I would also love to know why it takes so long.

My main reason, (and I suspect many people's) main use case for GPT alternatives include both open source AND hopefully faster speed. Reducing the memory profile but increasing the lag seems like a lateral move.

Sign up or log in to comment