Low t/s on 2 4090's

#3
by hpnyaggerman - opened

Output generated in 208.93 seconds (0.53 tokens/s, 111 tokens, context 125, seed 1431550975)
^^^ This is bluemoonrp-30b with --pre_layer 30 60
Output generated in 9.68 seconds (20.36 tokens/s, 197 tokens, context 125, seed 394076319)
^^^ This is llama-30b-4bit-128g with --pre_layer 30 60

I must note, I am executing bluemoonrp-30b with upstream GPTQ, while llama I am executing on ooba's fork. But I must be doing something terribly wrong if the speeds are this bad, it can't be that the new cuda branch slows stuff down by the factor of 40. What am I doing wrong?

Sign up or log in to comment