NousResearch/Nous-Hermes-Llama2-13b · Suggested Settings for loading/using e.g. OobaBooga

Jul 27, 2023

Thanks for creating and releasing this model. A lot of people want to use it but which settings would be most important to make it run well on consumer hardware, which a lot of people have.

For example:

Loader - Transformers? exLlama? Llamaccp?
GPU/CPU memory allocations?
Chat Parameters - e.g. new tokens, etc.

Maybe you could provide some rough, ballpark suggestions for use with what would be low-end, middle-range, high-end systems

https://github.com/oobabooga/text-generation-webui/tree/main/docs

teknium

NousResearch org Jul 29, 2023

Thanks for creating and releasing this model. A lot of people want to use it but which settings would be most important to make it run well on consumer hardware, which a lot of people have.

For example:

Loader - Transformers? exLlama? Llamaccp?

GPU/CPU memory allocations?

Chat Parameters - e.g. new tokens, etc.

Maybe you could provide some rough, ballpark suggestions for use with what would be low-end, middle-range, high-end systems

https://github.com/oobabooga/text-generation-webui/tree/main/docs

I haven't personally used oogabooga, but generally you would want to use GPTQ or GGML for fast inference and lower vram requirements at home.

It would require ~12GB of vram, if you dont have that you will need 12GB of ram, GGML supports CPU, GPTQ/Exllama does not

It supports up to 4096 context size for new tokens, but less will keep your vram and performance in check

teknium changed discussion status to closed Jul 29, 2023