--- license: other inference: false --- # Quantised GGMLs of alpaca-lora-65B Quantised 4bit and 5bit GGMLs of [changsung's alpaca-lora-65B](https://huggingface.co/chansung/alpaca-lora-65b) for CPU inference with [llama.cpp](https://github.com/ggerganov/llama.cpp). I also have 4bit GPTQ files for GPU inference available here: [TheBloke/alpaca-lora-65B-GPTQ-4bit](https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit). ## REQUIRES LATEST LLAMA.CPP (May 12th 2023 - commit b9fd7ee)! llama.cpp recently made a breaking change to its quantisation methods. I have re-quantised the GGML files in this repo. Therefore you will require llama.cpp compiled on May 12th or later (commit `b9fd7ee` or later) to use them. The previous files, which will still work in older versions of llama.cpp, can be found in branch `previous_llama`. ## Provided files | Name | Quant method | Bits | Size | RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | `alpaca-lora-65B.ggml.q4_0.bin` | q4_0 | 4bit | 40.8GB | 43GB | 4bit. | `alpaca-lora-65B.ggml.q5_0.bin` | q5_0 | 5bit | 44.9GB | 47GB | 5bit. Higher quality than 4bit, at cost of slightly higher resources. | `alpaca-lora-65B.ggml.q5_1.bin` | q5_1 | 5bit | 49GB | 51GB | Sbit. Slightly higher resource usage and quality than q5_0. | * The q4_0 file provides lower quality, but maximal compatibility. It will work with past and future versions of llama.cpp * The q5_0 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_0. * The q5_1 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_1. ## How to run in `llama.cpp` I use the following command line; adjust for your tastes and needs: ``` ./main -t 18 -m alpaca-lora-65B.ggml.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Write a story about llamas ### Response:" ``` Change `-t 18` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` ## How to run in `text-generation-webui` Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md). Note: at this time text-generation-webui will not support the new q5 quantisation methods. **Thireus** has written a [great guide on how to update it to the latest llama.cpp code](https://huggingface.co/TheBloke/wizardLM-7B-GGML/discussions/5) so that these files can be used in the UI. # Original model card not provided No model card was provided in [changsung's original repository](https://huggingface.co/chansung/alpaca-lora-65b). Based on the name, I assume this is the result of fine tuning using the original GPT 3.5 Alpaca dataset. It is unknown as to whether the original Stanford data was used, or the [cleaned tloen/alpaca-lora variant](https://github.com/tloen/alpaca-lora).