How many tokens per second?

#9
by Hoioi - opened

Could someone please share the number of tokens per second they get from running this model if they are running it only on CPU and RAM without GPU?

Hoioi changed discussion title from How many token per second? to How many tokens per second?

7 t/s on CPU/RAM only (Ryzen 5 3600), 10 t/s with 10 layers off load to GPU, 12 t/s with 15 layers off load to GPU

7 t/s on CPU/RAM only (Ryzen 5 3600), 10 t/s with 10 layers off load to GPU, 12 t/s with 15 layers off load to GPU

7 t/s on CPU/RAM seems pretty good. How much RAM do you have on your computer? And what interface do you use? text-generation-webui? koboldcpp or what?

Llamacpp
On my rtx 3090 ...around 40 t/s
Version q4k_m ( 30 layers on GPU )

Thank you for your replies. If anyone else has the statistics, please share with us.

Llamacpp
On my rtx 3090 ...around 40 t/s
Version q4k_m ( 30 layers on GPU )

Hi, can you please share the python code used to access the model, i am struggling to find any.

I'm using llamacpp ( one small binary fie ) to run model.

Thank you for your replies. If anyone else has the statistics, please share with us.

On RTX 4090 & i9-14900K. Benchmark using llama-bench from llama.cpp.

model size params backend ngl threads t/s pp 512 t/s tg 128
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 8 205.07 83.16
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 16 204.48 83.21
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 24 204.28 83.22
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 32 203.82 83.17
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 8 145.54 27.75
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 16 121.58 25.57
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 24 147.14 26.41
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 32 145.23 9.36
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 8 58.18 15.12
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 16 49.28 13.8
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 24 64.25 15.07
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 32 73.69 12.02
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 8 33.86 10.5
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 16 31.75 9.5
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 24 40.37 10.58
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 32 45.39 8.8
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 8 18.02 7.1
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 16 19.74 5.9
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 24 24.81 6.74
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 32 28.31 5.62

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

deleted

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

That is far worse than i get via an old Xeon CPU only, and im using the Q6 model ( latest ooba isn't using my GPU at all. I need to look into it later this week to see what is up )

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

It seems far less than what it should be. I'm almost sure something is wrong.

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

It is extremally slow ... I have ryzen 7950x3d and RTX 3090 getting 30+ tokens/s with q4k_m and with q5 10+ tokens/s (less layers on gpu )

hello, in my machine (Ryzen 7 1700; 40Gb Ram; RTX 4090 24 Gb VRam) I got just a mere 0.96 tokens/s. Oogabooga, mixtral-8x7b-v0.1.Q5_K_M.gguf, model loader = llama.cpp, n-gpu-layers=30; n_ctx=16384.
I tried the 6-bit model but it does not run (cuda error, out of memory).
I'll try another configs; any improvement I'll post here later.

Too much layer on GPU, especially with Q5. Try 18 to 20 layers instead.
With 30 GPU layers, it's likely that the excess will be stored in shared video memory (RAM), which is not at all advisable, given that it's the GPU that will be working with this very slow memory.
Also, don't forget that the first response after the model has been loaded into memory can take much longer...

Sign up or log in to comment