Why generating response is too slow?

#11

by daniq - opened Jun 17, 2023

Jun 17, 2023

I'm using hosted VM with these specs:

NVIDIA A40 with 46GB VRAM
AMD EPYC 7252 CPU
52 GB RAM

I've loaded this model into oobabooga text ui. Each generation of response takes too much time. For example, generating 351 tokens took about 435 seconds. I'm using default configs of this text ui.

Why it is like that? Does these specs are not enough or I need to tweak some configs?

daniq

Jun 19, 2023

I've been using ooba one click installer for my text ui. I asked some people in discord and they gave me one solution.

The solution was to create symlink like this:

ln -s "<installer_path>/installer_files/env/lib" "<installer_path>/installer_files/env/lib64"

Someone already made Merge Request that fixes this solution https://github.com/oobabooga/one-click-installers/pull/84

I hope they will accept it.

daniq changed discussion status to closed Jun 19, 2023

TheBloke

Owner Jun 20, 2023

That's strange that that was required.

By the way if you want to use cloud GPUs for text-generation-webui, I have a Runpod template with text-generation-webui + AutoGPTQ + ExLlama (for 2x faster inference on Llama GPTQ) + llama-cpp-python for GPU accelerated GGMLs.

It's all set up and ready to go and very easy to use. The templates are:

Or if you want to use the Docker config as a base for your own, then the source is here: https://github.com/TheBlokeAI/dockerLLM

daniq

Jun 23, 2023

Thanks for templates. However I'm using Vast.ai which is cheaper for me, but I still look it, maybe it is possible to use it there too.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment