Text Generation
Transformers
Safetensors
llama
conversational
Inference Endpoints
text-generation-inference

Great model, very powerful, but output never ends

#8
by bkieser - opened

I have been using the int4 quant on 2 x AMD 7900XTX. Works well, but the output never ends. As the output continues it starts to break down into internal tokens, training data extracts, etc.

I have been using the int4 quant on 2 x AMD 7900XTX. Works well, but the output never ends. As the output continues it starts to break down into internal tokens, training data extracts, etc.

Sorry to bother you, but what do you use ? LMStudio ? Koboldcpp ? Oobabooga ? And what is your generation speed in tokens per second ? Is it a nightmare to work with ROCm ?
I am curious because I plan on buying an AMD gpu.

I have been using the int4 quant on 2 x AMD 7900XTX. Works well, but the output never ends. As the output continues it starts to break down into internal tokens, training data extracts, etc.

Sorry to bother you, but what do you use ? LMStudio ? Koboldcpp ? Oobabooga ? And what is your generation speed in tokens per second ? Is it a nightmare to work with ROCm ?
I am curious because I plan on buying an AMD gpu.

It's my pleasure.

I have 2 x Radeon RX 7900 XTX (2 x 24GB) and I use Lllama.cpp running in server mode because some of the inference needs to be offloaded to Ram.
For Smaug, best setting is ngl 70 (70 layers offloaded onto the 2 7900 cards) with context 4096 (-c 4096)

Here is an example of the command line:
llama.cpp/server -m /home/tmp/Smaug-Llama-3-70B-Instruct.i1-Q4_K_M.gguf -n 4096 -c 4096 -ngl 70 --host 0.0.0.0 --port 2600 -a smaug

You can see that I am running it in server mode while in dev, it will be productionised when it moves internally for the staff to use.

Regarding AMD:

  1. Works very well
  2. Not as fast as NVIDIA, but a lot cheaper!
  3. Memory is what matters, not GPU speed. You need that precious VRAM to have a useful context length and enough parameters for a genuinely useful LLM to run. Memory is more important than anything else.
  4. I have a Ryzen 7 7700X CPU. This is a bottlneck. A threadripper would be much better, esp. if you quantise to int 8 (q8) which is so much more efficient for both CPU and GPU.
  5. CPU lifts a load in the layers that can't be offloaded onto the GPUs and also, AMD support for shared memory is currently really bad and so CPU works hard shovelling data around. NVIDIA is massively more efficient and CUDA is very clearly better than ROCM. But.... ROCM works. Maybe not as fast, maybe the CPU works really hard too, but you do get the results.
  6. This is very important: The AMD cards ARE SMALLER than the massive Nvidia RTX 4090 that I have in the other server. This matters because the NVIDIA uses up all the physical slot space on a standard PC motherboard, you can only fit 1 on unless you go to a specialised board and then suddenly the price jumps hugely, 300% or more. But the AMD uses only 2 slots of space, which means that for most standard PC motherboards you can fit 2 cards in (but not 3, which will also require specialised motherboard). This size of card issue is very important because the GPU itself is only part of the cost.
  7. When you have 2 x cards, then you can also load in 2 x smaller LLMs that fit completely on each card and don't have to share data via the CPU and that is massively faster because it's all on the GPUs. There is a very real advantage to being able to run 2 LLMs in parallel even if the actual GPUs are slower than the mighty Nvidia 4090.

Speed: I am getting around 17 t/s which isn't great, sure, but it's running locally on a powerful LLM, no data being shared and able to access internal tools behind a firewall, so there's a big advantage there. Also I use crewai, autogen, etc and I can delegate to smaller models which run entirely on the GPU(s) if needed and that is then many times faster. It's a matter of carefully choosing which LLM you actually need for that bit of the task.

Hope this helps.;

I forgot to add, if in-memory, entirely on GPU (i.e. no CPU offloading), then getting 122 t/s.

I forgot to add, if in-memory, entirely on GPU (i.e. no CPU offloading), then getting 122 t/s.

Thank you very much for your detailed answer ! And when you say that you have 122 t/s, is it for a 70B model ? I expected the generation to slow down due to the communication between the two gpus

Sign up or log in to comment