How is it different than other 4bit quants?

#1
by yehiaserag - opened

I was downloading the model from https://huggingface.co/nealchandra/alpaca-13b-hf-int4 and I noticed you uploaded a model but the size is different while they should be exactly the same.
Do you have any idea why that is?

This is LoRA trained and the code to convert it was found here: https://github.com/jooray/alpaca-lora-13b/ which uses the weights found here https://huggingface.co/samwit/alpaca13B-lora

I simply used the conversion script to convert to pth, then converted back to pytorch format in order to run GPTQ to quantize and pack the model. There may be some size loss in that process or just because it's LoRA and not finetuned. Edit: My model's size matches the original. Also, the model you linked is finetuned for one epoch.

Though after testing, I don't believe that LoRA replicates Stanford Alpaca well and would finetune it on the actual dataset for the full 3 epochs if I had the resources. I would recommend you use https://huggingface.co/elinas/llama-30b-int4 as it has much better overall coherency with the correct sampler settings, as long as you have 15GB vram minimum to run it, 24GB total recommended. If you need the sampler values for llama, let me know and I can post them here.

Otherwise wait for a full finetuned version of alpaca-13b :)

elinas changed discussion status to closed
elinas changed discussion status to open

I think I'll have to wait, actually the difference between the lora and the full mode is what drove me to get the full model
Also only 12GB of vram here so it's the wait for me for either a release of the model or good support for offloading to cpu

finetune it on the actual dataset for the full 3 epochs if I had the resources

Someone has trained for 3 epochs on alpaca 30b here: https://huggingface.co/baseten/alpaca-30b

(I'm hoping you'll release the alpaca 30b int4 though as I don't have the VRAM to load it in 8-bit mode)

That's LoRA not finetuned so it won't replicate the original results as accurately. Regardless I have converted to int4 and uploaded it as it's not bad.

@elinas could you share the sampler values for llama?

It looks like you might be using LoRA from the uncleaned alpaca, there is a big differen't when using https://github.com/gururise/AlpacaDataCleaned

There's going to be more difference in fine tuning the model versus using LoRA. This is evident in the quality of alpaca 7b native vs alpaca 7b LoRA. This is using the Stanford dataset like most other alpaca models on here and this "cleaned" dataset was released a week ago and only has claims. I see no benchmarks on it actually being better.

There are no benchmarks, but you will notice a big qualitative difference, most notably it no longer says "As a large language model..." after cleaning :)

There's going to be more difference in fine tuning the model versus using LoRA.

In the LoRA paper they benchmark it vs FT. While we can expect FT to be better, they found a very small difference, with LoRA even sometimes being better (which I'm assuming this difference is less than the uncertainty in the metric).

Since the LoRA/FT gaps is small, personally, I would expect dataset quality to have larger impact than LoRA/FT, but you're welcome to disagree. It's certainly useful to have your contribution up!

Apologies if I came off as dismissive, but I based this off of the 7b finetined (native) model vs LoRA and there was a noticable difference in quality. Though, more parameters, trained with LoRA are better than it, specifically Alpaca-30B.

In the LoRA paper they benchmark it vs FT. While we can expect FT to be better, they found a very small difference, with LoRA even sometimes being better (which I'm assuming this difference is less than the uncertainty in the metric).

I have not personally read through the paper, but I find that Interesting. Is it the case as the parameter count gets larger, the less difference there is between FT and LoRA?

Anyway, back to the "cleaned" dataset. I do not have the compute to use LoRA for 13B or 30B. If someone goes ahead and releases the LoRA weights, I am willing to quantize them to 4bit. If you find someone who has done so, feel free to let me know.

Apologies if I came off as dismissive, but I based this off of the 7b finetined (native) model vs LoRA and there was a noticable difference in quality. Though, more parameters, trained with LoRA are better than it, specifically Alpaca-30B.

Hey no worries it's the internet! And huh, that's an interesting observation. LoRA is a bit annoying to support tbh, especially if you want to use text-generation-webui.

I do not have the compute to use LoRA for 13B or 30B.

Neither, it's a sad state of affairs, but at least we are managing to get around it. Hopefully it gets easier on us in the next few years.

If someone goes ahead and releases the LoRA weights

Chansung has said he plans to ( https://huggingface.co/chansung/alpaca-lora-30b/discussions/2#641ef2d8dd15d15f8ed2e0b3 ) once they have finished cleaning the data (they are still going, for example 80% of maths examples were wrong :O).

How did you combine the LoRA and base model btw?

It's a bit of a process, at least originally. I had to use a modified LoRA merging script which I modified for 30b, as it was already working on 13b. Then I converted the weights to PyTorch/HF (which I can use now to avoid the first step), and then I finally quantize the model. The code is on my GitHub if you're curious.

I'm not sure if there's any interest in uploading the PyTorch / HF weights for Alpaca but I might.

. The code is on my GitHub if you're curious.

Great I'll take a look thanks.

I'm not sure if there's any interest in uploading the PyTorch / HF weights for Alpaca but I might.

Oh those are the best and most compatible imo, I'm sure there is.

It this your github btw? https://github.com/elinas there don't seem to be any public repos?

official-elinas, someone stole my name ; )

Damn squatters!

This is the script? Hmm very nice. https://github.com/official-elinas/alpaca-lora-30b/blob/main/export_state_dict_checkpoint.py

So you had to translate state dict names and unpermute some weights too. Thanks for sharing, it's much better to do this than deal with the LoRA <> GPTQ <> text-webui incompatibilities

Edit: Disregard, found my answer above . :)

It's a bit of a process, at least originally. I had to use a modified LoRA merging script which I modified for 30b, as it was already working on 13b. Then I converted the weights to PyTorch/HF (which I can use now to avoid the first step), and then I finally quantize the model. The code is on my GitHub if you're curious.

I'm not sure if there's any interest in uploading the PyTorch / HF weights for Alpaca but I might.

I'm sorry, i'm trying very hard to figure out how to 4bit quantize weights like you have been doing (I have been using your weights! thank you~) but I would like to learn how to do this myself for future releases of llama.

Could you point me in the right direction for this? I am having trouble finding resources..

edit: nvm, Im dumb, code for 4bit quantization is in the repo:
https://github.com/qwopqwop200/GPTQ-for-LLaMa/

Sign up or log in to comment