cuda error when loading llama 7b chat

#24
by rachelshalom - opened

hi
I am using the latest langchain to load llama cpp
installed llama cpp python with: CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

nvidia-smi:
NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:3F:00.0 Off | 0 |
| N/A 29C P0 47W / 350W| 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 PCIe On | 00000000:56:00.0 Off | 0 |
| N/A 30C P0 51W / 350W| 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 PCIe On | 00000000:C3:00.0 Off | 0 |
| N/A 31C P0 49W / 350W| 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 PCIe On | 00000000:DA:00.0 Off | 0 |
| N/A 32C P0 51W / 350W| 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

I manage to load the model with cblas=1

llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA H100 PCIe) as main device
llm_load_tensors: mem required = 5114.10 MB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/35 layers to GPU
llm_load_tensors: VRAM used: 158.35 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 3000
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 1500.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 223.99 MB
llama_new_context_with_model: VRAM scratch buffer: 217.36 MB
llama_new_context_with_model: total VRAM used: 375.71 MB (model: 158.35 MB, context: 217.36 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

but when the model is processing a prompt I get a cuda error:

CUDA error 222 at /tmp/pip-install-uq8lpx95/llama-cpp-python_c2bd3bc9a27b49f3805443a95df1ea3d/vendor/llama.cpp/ggml-cuda.cu:7043: the provided PTX was compiled with an unsupported toolchain.
current device: 0

any advice to solve this?

Hey this is still a problem. I followed the steps shown in https://michaelriedl.com/2023/09/10/llama2-install-gpu.html and it did not solve the problem.

Sign up or log in to comment