ggml-cuda.cu:5974: an illegal memory access was encountered
I'm using GGUF model to run llama.cpp (newest code version), but encountered errors of ggml-cuda.cu:5974: an illegal memory access was encountered. The error msg is:
The error message
(base) PS C:\Users\x\code\llama.cpp_new\llama.cpp> .\build\bin\Release\main.exe -m ..\..\llama-cpp-python\models\llama-2-13b-chat.Q8_0.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 20Log start
main: build = 1198 (ebc9608)
main: seed = 1694166414
ggml_init_cublas: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6
Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ....\llama-cpp-python\models\llama-2-13b-chat.Q8_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q8_0 [ 5120, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ]
......
llama_model_loader: - tensor 360: blk.39.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 361: blk.39.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 362: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 2048
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q8_0
llm_load_print_meta: model size = 13.02 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3080) as main device
llm_load_tensors: mem required = 6761.07 MB (+ 1600.00 MB per state)
llm_load_tensors: offloading 20 repeating layers to GPU
llm_load_tensors: offloaded 20/43 layers to GPU
llm_load_tensors: VRAM used: 6429 MB
....................................................................................................
llama_new_context_with_model: kv self size = 1600.00 MB
llama_new_context_with_model: compute buffer total size = 96.47 MB
llama_new_context_with_model: VRAM scratch buffer: 95.00 MB
CUDA error 700 at C:\Users\xxx\Code\llama.cpp_new\llama.cpp\ggml-cuda.cu:5974: an illegal memory access was encountered
Try setting environment variable: "export VISIBLE_CUDA_DEVICES=0" or similar, and then set "device='cuda:0'" in your definition pipeline.