Some experience with using this model

#2
by Nafnlaus - opened

Using text-generation-webui with the llamacpp loader, these are the parameters I've discovered for the best balance between performance and quality (results from hours of running on random inputs):

** NVidia RTX 3090 + Intel i7-7800X @ 3.50GHz ***

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q3_K_S.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 1000000000 --loader llamacpp --n_batch 2048 --threads 3 --n_ctx 2048
Tokens/s: 11.40 (Note: advise trying 4 threads, seems to work better with heavier weight models)

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q3_K_M.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 60 --loader llamacpp --n_batch 2048 --threads 12 --n_ctx 2048
Tokens/s: 6.77

** NVidia RTX 3090 + NVidia RTX 3060 + Intel i7-7800X @ 3.50GHz ***

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q4_K_M.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 1000000000 --loader llamacpp --n_batch 2048 --threads 12 --n_ctx 2048 --tensor_split 47,13
Tokens/s: 9.63

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q4_K_M.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 1000000000 --loader llamacpp --n_batch 2048 --threads 4 --n_ctx 2048 --tensor_split 47,13
Tokens/s: 10.93

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q5_K_S.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 1000000000 --loader llamacpp --n_batch 2048 --threads 2 --n_ctx 2048 --tensor_split 43,17
Tokens/s: 7.65

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q5_K_S.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 1000000000 --loader llamacpp --n_batch 2048 --threads 3 --n_ctx 2048 --tensor_split 43,17
Tokens/s: 7.97

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q5_K_S.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 1000000000 --loader llamacpp --n_batch 2048 --threads 4 --n_ctx 2048 --tensor_split 43,17
Tokens/s: 8.21

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q5_K_S.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 1000000000 --loader llamacpp --n_batch 2048 --threads 5 --n_ctx 2048 --tensor_split 43,17
Tokens/s: 7.82

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q5_K_S.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 1000000000 --loader llamacpp --n_batch 2048 --threads 12 --n_ctx 2048 --tensor_split 43,17
Tokens/s: 7.60

python server.py ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q5_K_M.gguf --listen --listen-port 5678 --verbose --api --xformers --n-gpu-layers 59 --loader llamacpp --n_batch 2048 --threads 12 --n_ctx 2048
Tokens/s: 4.57 (I think it was 12 threads, but I can't be positive)

Just as a general note:
On single 3090 GPU you should get 20-25 tokens/second for Falcon 40B, anything below 20 means something is going wrong
Your speed is less than half of where it should be, indicating something is wrong. I'd do a console client test as a start.

When using llama.cpp (in 2023, this hopefully improves): Do not use multi GPU in any scenario you can avoid. Falcon 40B fits into single GPU at moderate quantization, use that.
Multi GPU comes with extreme performance losses currently. Use it for Falcon 180B, there you have no choice.

Sidenotes: using 100000000 gpu layers is useless. There are less than 100 layers, so you can make it a smaller number for convenience in reading.
Batch size of 2048 is too high, that's something you could try on a H100 but not on a consumer GPU. Try 128,256,512 (512 is default)

I've love a command line that would give me 20-25 tokens/sec if you have one.

Note that I've since switched to the oasst version of this model, so I can only comment on that, but new flags have come out since I last wrote that, and adding in --numa --mlock --mul_mat_q boosted up my performance to 16-17 tokens/s on a single 3090.

I could add a note that my 3090 has its power cut from 350W to 300W, but my understanding from CUDA benchmarks is that the performance hit from this is quite small relative to the power savings, just a couple percent, and it's been my general experience that the impact seems unnoticeable. And when running on two cards, the limit is inter-card bandwidth, not compute, anyway. Though I should probably measure it at some point and verify that I'm not shooting myself in the foot. [ED: my card is currently doing training, but going back up to 350W only boosts performance by 5%... which would equate to less than 1 extra token per second in inference)

Re, running on multiple cards, my experience (without NVLink) has been that the hit is usually not that huge IF one of your cards is only running a small number of layers. But the more layers are run on the second card, the more you get bandwidth bottlenecked. And I can confirm that llama.cpp's load balancing algorithm across multiple cards is garbage - some times it'll do great and sometimes poorly. The fewer layers you have on the second card, the less likely it'll get locked into its poor balancing mode.

Anyway, I'll be upgrading my second card (the 3060) to a 3090 shortly and adding NVLink, so it'll probably worth a second round of benchmarks at that point.

Start with -b 256
Prepend $env:CUDA_VISIBLE_DEVICES = "0"; (or an export on linux) to ensure it's single GPU

Though given you cut your power target, that might be the reason. Imho that's only useful for mining or 24/7 usage
I'd not give too much hopes on nvlink, stay on single GPU if performance matters to you

Start with -b 256

You can see in my command line above that I'm doing n_batch (equivalent to -b) 2048. Have also done low batches.

Prepend $env:CUDA_VISIBLE_DEVICES = "0"; (or an export on linux) to ensure it's single GPU

Obviously I do this on my single-GPU tests :)

Though given you cut your power target, that might be the reason.

As mentioned, at least with training, it's only a 5% performance penalty, to save 17% on the GPU's power consumption. 5% <= 1 token per second.

Imho that's only useful for mining

Not that...

or 24/7 usage

Bingo. :) I (A) train models, and (B) run inference to generate data to use to train models. Both mean 24/7 GPU usage. The very reason why I use Falcon-40B is because they don't lay any claim in their license to your generations like a lot of models (including Llama) do. I'm training an ultrafast summarization model at the moment, just got started on a dataset for a series of probabilistic state machine models (0-shot or multi-shot analysis of probabilities or relation strengths, returning one or more floating point responses), and in general look to build a suite of fully open models for use in multi-agent systems.

Thusfar, the only thing I've seen that we're doing different (apart from me trying every single command line option under the sun) is that you're running direct from llama.cpp, while I'm using the llama.cpp loader from text-generation-webui.

In my tests anything above -b 512 causes slowdowns (on my 4090 too) so I stick to smaller batches.
Though given you do 2k batches, I missed the obvious: you have a very large context. My context typically is not much larger than 2048.
Falcon models suffer from a moderate slowdown with context size, so if you run with large prompts I would assume that's the main reason.
I did not dig into the reason what is going on in llama.cpp, my guess is an inefficiency in handling the KV cache and broadcasting.
This can get better over time, if Falcon stays relevant.

In general Falcon 40B is a very good model, I prefer it over llama-2 (which is also fully permissive licensed)

P.S.
I also considered upgrades, I had a 3070 + 3090. I decided to go for the 4090 and would recommend that.
It's considerably faster than the 3090 and it supports the modern APIs of Nvidia, including FP8 which is much faster than FP16 or FP32. So prompt ingestion speed could benefit it llama.cpp integrates it which isn't that big a development. The lack of NVlink is there, sadly that's deprecated on consumer hardware now. Nvidia protecting their endgame.

LLaMA 2 is "relatively" permissively licensed but they ban use of outputs to train anything other than other LLaMA models, so it's viral, which leads to it dominating usage among people training their own models, iteratively. Which is where the other part of the license hits: if any tool ever DOES make it big - no matter how many iterations down the road - Meta retains commercial rights to that. Very clever strategy, IMHO....

I'm not going to use anything to generate outputs where they impose license restrictions on my outputs . Hence, Falcon 40B :) I want my training datasets to be fully free and not infect anyone else's projects.

I think I've discovered the difference between our experiences.

When you look at something like this (this is running the related oasst Falcon 40B model's Q3_K_M, the largest model that I can fit into one card, and using the --numa --mlock --mul_mat_q flags that as mentioned increase performance over my initial results):

llama_print_timings: load time = 522.32 ms
llama_print_timings: sample time = 282.65 ms / 287 runs ( 0.98 ms per token, 1015.40 tokens per second)
llama_print_timings: prompt eval time = 529.75 ms / 107 tokens ( 4.95 ms per token, 201.98 tokens per second)
llama_print_timings: eval time = 14150.93 ms / 286 runs ( 49.48 ms per token, 20.21 tokens per second)
llama_print_timings: total time = 16421.51 ms
127.0.0.1 - - [19/Nov/2023 12:13:14] "POST /api/v1/chat HTTP/1.1" 200 -
Llama.generate: prefix-match hit

Looking at the above, would you read that as "20.21 tokens per second"? And then add in 5% performance for the cut power on the card, and get 21,2 tokens per second, right in your 20-25 tokens/s?

I don't. I say =14150.93 / 49.48 = 286 tokens ("286 runs"); 286 tokens / 16,421ms total time = 17,42 tokens/ms.

Furthermore, when averaging net performance, I don't just average the above (like "17,42") - I add together total elapsed time, and total tokens generated. So that quick generations aren't overweighted vs. long-generations where most of the time is actually spent.

Basically, my interest is in real-world performance, how many tokens you actually get per second of GPU usage. Not just some number that only describes one part of the run.

Sign up or log in to comment