guanaco-65b

#1
by bodaay - opened

First

You are awesome man, I keep refreshing you repo every hour to see whats new :)

Are you going to release the 65B parameters version

and I'm wondering, this new QLora, since I don't really understand the difference, is it any better or more efficient?

Thanks! Glad you're finding these models useful.

Of course I'm going to release 65B :) It's processing now - will take a couple of hours at least though. And I'll do 7B as well when I'm back - going for dinner now.

re QLORA - the key advantage is that it makes it easier/cheaper to train, because it enables training on GPUs with a lot less VRAM.

Basically it's quantised training. You can now train in 4bit, which means instead of needing loads of big GPUs, like 4 x A100 80GB, you can now train on a single GPU. Possibly the quality/accuracy will be slightly lower (I don't know for sure), but it shouldn't be that noticeable.

So expect to see a lot more models coming out!

Thanks for the clarification

@TheBloke just a "thank you" note for doing all the hard work releasing quantiized models before I can even download the original models to try out most of the time. From my testing so far with the original qLora version of this model, it seems to be the best quality and most promising so far compared to the other LLaMA and non-LLaMA-based models so far. But the qLora inference times were very slow.

Again, "thank you" from an internet stranger for your work.

Edit: just loaded this model and the inference speed is night and day compared to the qLora version! We now have a viable inference (GPTQ) AND fine-tuning option (qLora) for consumer-grade GPUs.

@TheBloke just a "thank you" note for doing all the hard work releasing quantiized models before I can even download the original models to try out most of the time. From my testing so far with the original qLora version of this model, it seems to be the best quality and most promising so far compared to the other LLaMA and non-LLaMA-based models so far. But the qLora inference times were very slow.

Again, "thank you" from an internet stranger for your work.

Edit: just loaded this model and the inference speed is night and day compared to the qLora version! We now have a viable inference (GPTQ) AND fine-tuning option (qLora) for consumer-grade GPUs.

Thank you, I just wanted to use qLora for inference, it seems that there is no need to try, continue to use GPTQ. In addition, I use the dual card 3090 connected through nvlink to run the guanaco-65B GPTQ model, and Its effect does not seem to be as good as 33B.

@hi

@TheBloke just a "thank you" note for doing all the hard work releasing quantiized models before I can even download the original models to try out most of the time. From my testing so far with the original qLora version of this model, it seems to be the best quality and most promising so far compared to the other LLaMA and non-LLaMA-based models so far. But the qLora inference times were very slow.

Again, "thank you" from an internet stranger for your work.

Edit: just loaded this model and the inference speed is night and day compared to the qLora version! We now have a viable inference (GPTQ) AND fine-tuning option (qLora) for consumer-grade GPUs.

Thank you, I just wanted to use qLora for inference, it seems that there is no need to try, continue to use GPTQ. In addition, I use the dual card 3090 connected through nvlink to run the guanaco-65B GPTQ model, and Its effect does not seem to be as good as 33B.

So 33B almost as good as 65B?

Sign up or log in to comment