Receiving assert error when trying to do prediction with the model .
Anybody else experiencing GGML_ASSERT error when trying to run this model with cblast llama.cpp ?
I am using latest version of Cblast llama.cpp (from 31/July2023) but experiencing the same issue with several older versions as well
I am using Model card example of prompt.
Model seems to be loading fine, but then for response I only get 1-liner: "GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:10463: ne02 == ne12"
Any suggestion on how to fix it, would be appreciated.
Thanks
OUTPUT I am getting is as follows:
main.exe -t 16 -ngl 21 -gqa 8 -m C:\Users\robo\Downloads\upstage-llama-2-70b-instruct-v2.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### System:\nThis is a system prompt, please behave and help the user.\n\n### User:\nWrite a story about llamas\n\n### Assistant:"
main: warning: base model only supports context sizes no greater than 2048 tokens (4096 specified)
main: build = 930 (9d2382b)
main: seed = 1690818332
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 4090 Laptop GPU'
ggml_opencl: device FP16 support: false
llama.cpp: loading model from C:\Users\robo\Downloads\upstage-llama-2-70b-instruct-v2.ggmlv3.q4_K_M.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 7168
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required = 28985.77 MB (+ 1280.00 MB per state)
llama_model_load_internal: offloading 21 repeating layers to GPU
llama_model_load_internal: offloaded 21/81 layers to GPU
llama_model_load_internal: total VRAM used: 10478 MB
llama_new_context_with_model: kv self size = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 561.35 MB
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0
System:\nThis is a system prompt, please behave and help the user.\n\n### User:\nWrite a story about llamas\n\n### Assistant:GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml.c:10463: ne02 == ne12
I am also getting an error while trying to run cublas. But my error says:
CUDA error 222 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:4816: the provided PTX was compiled with an unsupported toolchain.
Ok my error was related to having out of date video drivers. I have fixed it.
Hi dillfrescott
Thanks for note about cublas working for you.
I tried it and cublas is working for me as well.
Model GGML conversion therefore seems to be working fine and issue is in cblast version of llama.cpp implementation
I am upgrading to cublas and ditching cblast mess.
Issue solved
Thanks
No problem! Glad you were able to sort the issue out!
I'm still seeing this assertion error with clBLAST.
same here :)