Why generating response is too slow?

#11
by daniq - opened

I'm using hosted VM with these specs:

  • NVIDIA A40 with 46GB VRAM
  • AMD EPYC 7252 CPU
  • 52 GB RAM

I've loaded this model into oobabooga text ui. Each generation of response takes too much time. For example, generating 351 tokens took about 435 seconds. I'm using default configs of this text ui.

Why it is like that? Does these specs are not enough or I need to tweak some configs?

I've been using ooba one click installer for my text ui. I asked some people in discord and they gave me one solution.

The solution was to create symlink like this:

ln -s "<installer_path>/installer_files/env/lib" "<installer_path>/installer_files/env/lib64"

Someone already made Merge Request that fixes this solution https://github.com/oobabooga/one-click-installers/pull/84

I hope they will accept it.

daniq changed discussion status to closed

That's strange that that was required.

By the way if you want to use cloud GPUs for text-generation-webui, I have a Runpod template with text-generation-webui + AutoGPTQ + ExLlama (for 2x faster inference on Llama GPTQ) + llama-cpp-python for GPU accelerated GGMLs.

It's all set up and ready to go and very easy to use. The templates are:

Or if you want to use the Docker config as a base for your own, then the source is here: https://github.com/TheBlokeAI/dockerLLM

Thanks for templates. However I'm using Vast.ai which is cheaper for me, but I still look it, maybe it is possible to use it there too.

Sign up or log in to comment