Open BMB's UltraLM 13B GGML

These files are GGML format model files for Open BMB's UltraLM 13B.

Note: I cannot make GGML k-quants for this model due to its vocab size of 32,001. Please see Compatibility below for more detail.

GGML files are for CPU + GPU inference using llama.cpp and libraries and UIs which support this format, such as:

Repositories available

Prompt template: Vicuna 1.1

USER: prompt


Original llama.cpp quant methods: q4_0, q4_1, q5_0, q5_1, q8_0

I have quantized these 'original' quantisation methods using an older version of llama.cpp so that they remain compatible with llama.cpp as of May 19th, commit 2d5db48.

These are guaranteed to be compatbile with any UIs, tools and libraries released since late May.

New k-quant methods not compatible with this model at this time

Unfortunately this model has a vocab size of 32,001. This breaks compatibility with the new GGML k-quant method. I cannot make k-quants for this reason.

For further explanation, please see:

Provided files

Name Quant method Bits Size Max RAM required Use case
ultralm-13b.ggmlv3.q4_0.bin q4_0 4 7.32 GB 9.82 GB Original llama.cpp quant method, 4-bit.
ultralm-13b.ggmlv3.q4_1.bin q4_1 4 8.14 GB 10.64 GB Original llama.cpp quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
ultralm-13b.ggmlv3.q5_0.bin q5_0 5 8.95 GB 11.45 GB Original llama.cpp quant method, 5-bit. Higher accuracy, higher resource usage and slower inference.
ultralm-13b.ggmlv3.q5_1.bin q5_1 5 9.76 GB 12.26 GB Original llama.cpp quant method, 5-bit. Even higher accuracy, resource usage and slower inference.
ultralm-13b.ggmlv3.q8_0.bin q8_0 8 13.83 GB 16.33 GB Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

How to run in llama.cpp

I use the following command line; adjust for your tastes and needs:

./main -t 10 -ngl 32 -m ultralm-13b.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"

If you're able to use full GPU offloading, you should use -t 1 to get best performance.

If not able to fully offload to GPU, you should use more cores. Change -t 10 to the number of physical CPU cores you have, or a lower number depending on what gives best performance.

Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins

How to run in text-generation-webui

Further instructions here: text-generation-webui/docs/llama.cpp-models.md.


Original model card: Open BMB's UltraLM 13B


This is UltraLM-13b delta weights, a chat language model trained upon UltraChat

Model Details

Model Description

The model is fine-tuned based on LLaMA-13b with a multi-turn chat-format template as below

User: instruction 1<eos_token>
Assistant: response 1<eos_token>
User: instruction 2<eos_token>
Assistant: response 2<eos_token>
  • License: UltraLM is based on LLaMA and should be used under LLaMA's model license.
  • Finetuned from model: LLaMA-13b
  • Finetuned on data: UltraChat

Model Sources


To use this model, you need to recover the full model from the delta weights and perform inference following the template below:

[Optional]User: system prompt<eos_token>
User: user input<eos_token>
