Any chance of a 13B-20B version?

by smcleod - opened Aug 29, 2023

Aug 29, 2023

Is there any chance there will be slightly smaller version somewhere between 13B and 20B~ that's likely to run on more common GPUs with 16GB of vRAM?

A lot of the decent coding models coming out seem to be focused on folks with 24GB+ cards.

corey4005

Aug 29, 2023

How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?

smcleod

Aug 29, 2023

How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?

Because a 34B model won’t fit on a 16GB GPU, quantised at 4bit it should however just fit on a 24GB GPU.

corey4005

Aug 29, 2023

•

edited Aug 29, 2023

How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?

Because a 34B model won’t fit on a 16GB GPU, quantised at 4bit it should however just fit on a 24GB GPU.

Thanks for the response. Is there anything I can read that will help me understand the math better? In other words, how do you know what fits and what does not? I appreciate any information you can pass along. I am assuming that when I build my next PC, I need to get a GPU that will be able to handle these models, like an RTX 4090?

smcleod

Aug 31, 2023

@corey4005 - so I was able to get the v2 (phind-codellama-34b-v2.Q4_K_M.gguf) of this model running on my little Tesla P100 (16GB), but it's very slow (2.5-3tk/s).

Output generated in 40.89 seconds (2.69 tokens/s, 110 tokens, context 454, seed 403749230)
MEM[|||||||||||||||||15.560Gi/16.000Gi]

V2 GGUF - https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF/blob/main/phind-codellama-34b-v2.Q4_K_M.gguf

Settings: