TheBloke
/

alpaca-lora-65B-GGML

Model card Files Files and versions Community

Thank you very much!

by AiCreatornator - opened Apr 20, 2023

Apr 20, 2023

I just want to thank you, because I have been waiting for this type of model, and you are the first one who made these available here. This is the best model I have tried locally this far. Thank you!

TheBloke

Owner Apr 20, 2023

You're welcome! Glad it's working well for you.

Lumpen1

Apr 20, 2023

Also want to thank you for this!

SamuelAzran

Apr 22, 2023

Is it possible to run it on a GPU with HF?

TheBloke

Owner Apr 22, 2023

It's possible to run on the GPU yes.

I've done these repos:
4bit GPTQ quantisation: https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit
Full unquantised HF format: https://huggingface.co/TheBloke/alpaca-lora-65B-HF

The latter would need 128+ GB of VRAM so that's not likely to be viable for most people. The GPTQ 4bits should hopefully run in 40GB of VRAM, eg 1 x A100 40GB or 2 x 24GB cards like a 3090 or 4090. I haven't actually tested them yet, I'm planning to do so soon. But they should work OK.

Here's an explanation of the three different files on the GPTQ repo. I've not had a chance to add this to the README yet:

alpaca-lora-65B-GPTQ-4bit-128g.safetensors :

GPTQ 4bit 128g with --act-order. Should be highest possible quality quantisation. Will require recent GPTQ-for-LLaMA code; will not work with oobaboog's fork, and therefore won't work with the one-click-installers for Windows.

alpaca-lora-65B-GPTQ-4bit-1024g.safetensors: Same as the above but with a groupsize of 1024. This possibly reduces the quantisation quality slightly, but will require less VRAM. Created with the idea of ensuring this file could load in 40GB VRAM on an A100 - it's possible the 128g will need more than 40GB.

alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors:

GPTQ 4bit 128g without --act-order. Possibly slightly lower accuracy. Will work with oobabooga's GPTQ-for-LLaMA fork and the one-click installers

Waggle4461

Apr 26, 2023

•

edited Apr 26, 2023

This probably generates the most ChatGPT 3.5-like responses of any local setup I've tried. Pretty cool. It's slow even on a "fast" by consumer standards computer but I'd rather wait than get useless output.

mirek190

May 2, 2023

•

edited May 10, 2023

Yep - that is the most close model to GPT 3.5 ... or even better than GPt 3.5 especially q5_1.
Any ultra max mega tuned models 7B or 13B are not even close to standard alpaca-lora 65B.

I am testing by writing stories capability ..So . are better that GPT 3.5 actually.
Coding seems also better ....

clevnumb

May 10, 2023

I have a 65GB RAM system (I9-13900k) system with a 4090 video card, but I guess I still need to use the CPU version..how do I most easily install this and get it running? the model page says something would have to be compiled? I currently have textual generation-UI set up and it works with lower models...thanks.

TheBloke

Owner May 10, 2023

You should be able to use llama.cpp models in text-generation-webui. Check out these docs: https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md

mirek190

May 10, 2023

Yep ..llama.cpp is great.

dvc071

May 19, 2023

How would 65B-HF run on the Mac with M1 Ma Studio (with 10-core CPU, 24-core GPU, 16-core Neural Engine, 32GB unified memory)?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment