Is there a 4bit quantize version for the FastChat?

#2
by ruradium - opened

I tried TheBloke/vicuna-33B-preview-GPTQ, but get error int the loading state

showing Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.g_idx" etc

FastChat uses an older and, in my opinion, no-longer-recommended GPTQ implementation called GPTQ-for-LLaMa which requires manually setting GPTQ parameters. If it used AutoGPTQ instead, it would auto detect the parameters and it would work automatically.

As I said to you on your post in my repo, you need to set group_size to None or -1.

tried, but still the same error

OK I just tested it and yes I see the same. It is because recent GPTQ-for-LLaMa branches have broken compatibility again. This is why I don't recommend using it.

However you can get it to work if you use the old-cuda branch of GPTQ-for-LLaMa.

To do that:

cd FastChat/repositories/GPTQ-for-LLaMa
pip3 uninstall -y quant-cuda
git switch old-cuda
python3 setup_cuda.py install

Then test again, without --gptq-groupsize parameter:

 python3 -m fastchat.serve.cli \
    --model-path /workspace/process/vicuna-33b/gptq \
    --gptq-wbits 4

Here is a log of me making that change and successfully using FastChat with my Vicuna 33B Preview GPTQ:

 [fastchat] ubuntu@h100:/workspace/git/FastChat (main ✘)✭ ᐅ cd repositories/GPTQ-for-LLaMa
 [fastchat] ubuntu@h100:/workspace/git/FastChat/repositories/GPTQ-for-LLaMa (fastest-inference-4bit ✘)✭ ᐅ ll
total 132K
-rw-rw-r-- 1 ubuntu ubuntu  12K Jun 23 12:33 LICENSE.txt
-rw-rw-r-- 1 ubuntu ubuntu 7.9K Jun 23 12:33 README.md
drwxrwxr-x 2 ubuntu ubuntu 4.0K Jun 23 12:35 __pycache__
drwxrwxr-x 5 ubuntu ubuntu 4.0K Jun 23 12:34 build
-rw-rw-r-- 1 ubuntu ubuntu 1.1K Jun 23 12:33 convert_llama_weights_to_hf.py
drwxrwxr-x 2 ubuntu ubuntu 4.0K Jun 23 12:34 dist
-rw-rw-r-- 1 ubuntu ubuntu 7.6K Jun 23 12:33 gptq.py
-rw-rw-r-- 1 ubuntu ubuntu  20K Jun 23 12:33 llama.py
-rw-rw-r-- 1 ubuntu ubuntu  16K Jun 23 12:33 neox.py
-rw-rw-r-- 1 ubuntu ubuntu  18K Jun 23 12:33 opt.py
drwxrwxr-x 3 ubuntu ubuntu 4.0K Jun 23 12:35 quant
-rw-rw-r-- 1 ubuntu ubuntu 1.3K Jun 23 12:33 quant_cuda.cpp
drwxrwxr-x 2 ubuntu ubuntu 4.0K Jun 23 12:33 quant_cuda.egg-info
-rw-rw-r-- 1 ubuntu ubuntu 7.9K Jun 23 12:33 quant_cuda_kernel.cu
-rw-rw-r-- 1 ubuntu ubuntu  170 Jun 23 12:33 requirements.txt
-rw-rw-r-- 1 ubuntu ubuntu  333 Jun 23 12:33 setup_cuda.py
drwxrwxr-x 3 ubuntu ubuntu 4.0K Jun 23 12:35 utils
 [fastchat] ubuntu@h100:/workspace/git/FastChat/repositories/GPTQ-for-LLaMa (fastest-inference-4bit ✘)✭ ᐅ pip3 uninstall quant-cuda
Found existing installation: quant-cuda 0.0.0
Uninstalling quant-cuda-0.0.0:
  Would remove:
    /workspace/venv/fastchat/lib/python3.10/site-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg
Proceed (Y/n)? Y
  Successfully uninstalled quant-cuda-0.0.0
 [fastchat] ubuntu@h100:/workspace/git/FastChat/repositories/GPTQ-for-LLaMa (fastest-inference-4bit ✘)✭ ᐅ git switch old-cuda
Branch 'old-cuda' set up to track remote branch 'old-cuda' from 'origin'.
Switched to a new branch 'old-cuda'
 [fastchat] ubuntu@h100:/workspace/git/FastChat/repositories/GPTQ-for-LLaMa (old-cuda ✘)✭ ᐅ python3 setup_cuda.py install
running install
/workspace/venv/fastchat/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/workspace/venv/fastchat/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info/PKG-INFO
writing dependency_links to quant_cuda.egg-info/dependency_links.txt
writing top-level names to quant_cuda.egg-info/top_level.txt
reading manifest file 'quant_cuda.egg-info/SOURCES.txt'
writing manifest file 'quant_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
building 'quant_cuda' extension
Emitting ninja build file /workspace/git/FastChat/repositories/GPTQ-for-LLaMa/build/temp.linux-x86_64-cpython-310/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF /workspace/git/FastChat/repositories/GPTQ-for-LLaMa/build/temp.linux-x86_64-cpython-310/quant_cuda.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/workspace/venv/fastchat/lib/python3.10/site-packages/torch/include -I/workspace/venv/fastchat/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/workspace/venv/fastchat/lib/python3.10/site-packages/torch/include/TH -I/workspace/venv/fastchat/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/workspace/venv/fastchat/include -I/usr/include/python3.10 -c -c /workspace/git/FastChat/repositories/GPTQ-for-LLaMa/quant_cuda.cpp -o /workspace/git/FastChat/repositories/GPTQ-for-LLaMa/build/temp.linux-x86_64-cpython-310/quant_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[2/2] /usr/local/cuda/bin/nvcc  -I/workspace/venv/fastchat/lib/python3.10/site-packages/torch/include -I/workspace/venv/fastchat/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/workspace/venv/fastchat/lib/python3.10/site-packages/torch/include/TH -I/workspace/venv/fastchat/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/workspace/venv/fastchat/include -I/usr/include/python3.10 -c -c /workspace/git/FastChat/repositories/GPTQ-for-LLaMa/quant_cuda_kernel.cu -o /workspace/git/FastChat/repositories/GPTQ-for-LLaMa/build/temp.linux-x86_64-cpython-310/quant_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1013"' -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -std=c++17

... log trimmed here ...

x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 /workspace/git/FastChat/repositories/GPTQ-for-LLaMa/build/temp.linux-x86_64-cpython-310/quant_cuda.o /workspace/git/FastChat/repositories/GPTQ-for-LLaMa/build/temp.linux-x86_64-cpython-310/quant_cuda_kernel.o -L/workspace/venv/fastchat/lib/python3.10/site-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib/x86_64-linux-gnu -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/quant_cuda.cpython-310-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-cpython-310/quant_cuda.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for quant_cuda.cpython-310-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/quant_cuda.py to quant_cuda.cpython-310.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying quant_cuda.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.quant_cuda.cpython-310: module references __file__
creating 'dist/quant_cuda-0.0.0-py3.10-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing quant_cuda-0.0.0-py3.10-linux-x86_64.egg
creating /workspace/venv/fastchat/lib/python3.10/site-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg
Extracting quant_cuda-0.0.0-py3.10-linux-x86_64.egg to /workspace/venv/fastchat/lib/python3.10/site-packages
Adding quant-cuda 0.0.0 to easy-install.pth file

Installed /workspace/venv/fastchat/lib/python3.10/site-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg
Processing dependencies for quant-cuda==0.0.0
Finished processing dependencies for quant-cuda==0.0.0
 [fastchat] ubuntu@h100:/workspace/git/FastChat/repositories/GPTQ-for-LLaMa (old-cuda ✘)✭ ᐅ cd ../..
 [fastchat] ubuntu@h100:/workspace/git/FastChat (main ✘)✭ ᐅ python3 -m fastchat.serve.cli \
    --model-path /workspace/process/vicuna-33b/gptq \
    --gptq-wbits 4
Loading GPTQ quantized model...
Loading model ...
Done.
USER: write a story about llamas
ASSISTANT: Once upon a time, in the picturesque mountains of Peru, there was a magical land called Llama Valley. This valley was home to a diverse and vibrant community of llamas, each with their unique personalities and abilities. The llamas lived in harmony, taking turns watching over the valley and protecting it from any dangers that may emerge.

The story revolves around a young llama named Llama League. Llama League was a curious and adventurous soul, always eager to learn about the world and the powers that existed within it. One day, Llama League was exploring the mountainside when they stumbled upon a mysterious, ancient temple hidden deep within the mountains. This temple was known as the Temple of the Stars, and it was said to contain the secret to unlocking the true power of the llama community.

Inside the temple, Llama League discovered a vast chamber with a massive mosaic on the floor, depicting a constellation of llamas surrounded by a celestial circle. In the center of the mosaic stood a pedestal with a peculiar artifact: the Star Orb. The Star Orb seemed to pulse with an ethereal energy, and Llama League could feel its power resonating within them.

As Llama League touched the Star Orb, they were suddenly overcome by a surge of ancient knowledge and mystical abilities. The young llama discovered they could harness the power of the stars, using it to protect their valley and help their fellow llamas. With this newfound power, Llama League became the leader of the Llama League, a group of talented llamas who swore to protect their home and uphold the principles of unity and harmony.

Not all was well in the Llama Valley, however. An ancient evil, known as the Shadow Beast, had been locked away for centuries, but it was now stirring once again. The Shadow Beast sought to destroy the valley and its inhabitants, hoping to plunge the world into darkness.

Armed with their newfound abilities, Llama League and their companions set out on an epic journey to gather their fellow llamas and prepare them for the incoming threat. They traversed the valley, meeting with the wisest elders and the strongest warriors, each of
USER:

Thanks, got them work, but extreamly slow on my side, I can see that the model is loaded into GPU VRAM, but looks like it is inferencing in the CPU.
I also got such error message when loading:
2023-06-23 21:18:57.809967: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

Large Model Systems Organization org

@ruradium @TheBloke Thanks for the discussion here. Please help us update the gptq support in FastChat. It is a community-contributed feature that we do not have the bandwidth to maintain.

@lmzheng OK, I understand. Sorry.

I'll have a look to see if I can PR an AutoGPTQ implementation.

Sign up or log in to comment