I'm trying to run this using oobabooga but I'm getting 0.17 tokens/second.

#18
by Said2k - opened

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models\llama-13b-4bit-128g.

Modify your start-webui with this line:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128

I got this Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 8.00 GiB total capacity; 7.08 GiB already allocated; 0 bytes free; 7.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 13.07 seconds (0.00 tokens/s, 0 tokens, context 43)

I have 64GB of RAM and 8GB of VRAM.

Said2k changed discussion title from I'm trying to run this using oobabooga but it's not recognizing it. to I'm trying to run this using oobabooga but I'm getting an OOM Error.

Someone mentioned on oobabooga's repository issues that you need to also use the "pre_layer" flags in order to not completely allocate your GPU with the model and allow part of its VRAM to be used for text generation. The higher the "pre_layer" number, the faster the model will respond but also the more likely it'll run out of VRAM. I used my "pre_layer" parameter on value 26 so it's a bit slow but still manageable. Depending on how big the text history is, VRAM may still run out, I tried messing with parameters but still no success so far. Anyways the line should look like this:

call python server.py --auto-devices --chat --wbits 4 --groupsize 128 --pre_layer 26

Note: If anyone reading this is getting CPU out of memory (not GPU), try increasing the virtual memory on your OS to over 100GB.

Try again with this line:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128 --gpu-memory 7 --pre_layer 19

This will limit the amount of VRAM usage, works on my RTX 3070 w/8GB VRAM.

I'm getting an average of 0.17 tokens/second, is this normal?

Said2k changed discussion title from I'm trying to run this using oobabooga but I'm getting an OOM Error. to I'm trying to run this using oobabooga but I'm getting 0.17 tokens/second.

Sign up or log in to comment