Vram usage

#3
by Juuuuu - opened

Can you guys tell me the vram usage of this model. I a 3080ti laptop with 8gb.
Thanks

8-9 gb of vram is required

I see 8.7-8.9 used on my 16GB laptop 3080 with the model loaded in oogabooga. It goes up to 12.2 when it's actually generating text.

8gb cards load it only with 50% layers offload to CPU

@Yuuru How can I try this?

@ghogan42 12.2, I have a desktop 3060 which only has 12GB, can I offload a few layers to CPU so I can run it?

curious wonder 12g 3060 able to run this model or not

@cyx123 I have a 12g 3060 and it has no problem running every 13b models as they fluctuate between 9 and 11 gb of vram usage.

I got RuntimeError: CUDA error: out of memory with NVIDIA 3070 8GB

I could make it work with 8GB VRAM, slow:
Output generated in 19.02 seconds (0.63 tokens/s, 12 tokens, context 132)
Output generated in 47.36 seconds (1.03 tokens/s, 49 tokens, context 230)
Output generated in 28.04 seconds (0.96 tokens/s, 27 tokens, context 363)

from https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g/discussions/14
edit start-webui.bat and replace all the text with:

@echo off

@echo Starting the web UI...

cd /D "%~dp0"

set MAMBA_ROOT_PREFIX=%cd%\installer_files\mamba
set INSTALL_ENV_DIR=%cd%\installer_files\env

if not exist "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" (
call "%MAMBA_ROOT_PREFIX%\micromamba.exe" shell hook >nul 2>&1
)
call "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" activate "%INSTALL_ENV_DIR%" || ( echo MicroMamba hook not found. && goto end )
cd text-generation-webui

call python server.py --auto-devices --chat --threads 8 --wbits 4 --groupsize 128 --pre_layer 30

:end
pause

yeah i m also getting 1 token per second with splitting on 8GB VRAM, the performance is bad, i was able to achieve the same using ggml model + llama.cpp withDRAM

is it possible to somehow run it on 6gb vram? i have a laptop with 3060rtx
so far getting CUDA out of memory message

Sign up or log in to comment