Something wrong with this model I think.

#1
by dranger003 - opened

All your models are working fine, but this one (tried openchat_v3.2.ggmlv3.q8_0.bin and openchat_v3.2.ggmlv3.q4_K_M.bin) is giving me an error (see below) - this is using the latest llama.cpp compiled from source:
GGML_ASSERT: C:\llama.cpp\ggml-cuda.cu:4749: i01_high == rows_per_iter || g_device_count > 1

Edit: something to do with CUDA as it works fine running on CPU only. This may be helpful, too (the proposed fix works for me although I inserted @ line 4744 instead of 4381 - probably not same commit):
https://github.com/ggerganov/llama.cpp/pull/2160#issuecomment-1657203763

Edit #2: I tried converting from the source model and I get the same error. Could it be the 32002 vocab size maybe?

Hmm, yes the non-standard vocab size has caused problems with CUDA in GGML in the past. But it's working fine for me with offloading all layers:

 [pytorch2] tomj@d442126f7dde:/workspace/git/llama.cpp (master ✔) ᐅ ./main -m /workspace/process/openchat_v3.2/ggml/openchat_v3.2.ggmlv3.q8_0.bin -ngl 100 -t 1 -p "GPT4 User: write a story about llamas<|end_of_turn|>GPT4 Assistant:"
main: build = 933 (0728c5a)
main: seed  = 1690831897
ggml_init_cublas: found 4 CUDA devices:
  Device 0: NVIDIA L40, compute capability 8.9
  Device 1: NVIDIA L40, compute capability 8.9
  Device 2: NVIDIA L40, compute capability 8.9
  Device 3: NVIDIA L40, compute capability 8.9
llama.cpp: loading model from /workspace/process/openchat_v3.2/ggml/openchat_v3.2.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32002
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 6912
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA L40) as main device
llama_model_load_internal: mem required  =  532.14 MB (+  400.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 13784 MB
llama_new_context_with_model: kv self size  =  400.00 MB

system_info: n_threads = 1 / 256 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 GPT4 User: write a story about llamas<|end_of_turn|>GPT4 Assistant: In the high plains of Peru, nestled between the Andes Mountains and the sprawling grasslands, lay a mysterious valley. For centuries, this secluded spot had remained hidden from the outside world, its secrets kept by the ancient Incas who once inhabited these lands. The valley was home to a small herd of llamas, gentle creatures with long necks and comical faces, whose wool provided warmth for the people of the Andes. But these llamas were not just any ordinary animals; they held a special power that had been passed down from generation to generation.

One day, two travelers stumbled upon the valley while searching for a lost Incan city rumored to contain untold riches and treasures. As they walked through the lush greenery, they heard the sound of bells ringing in the distance. Curious, they followed the melodic tune until they came upon a herd of llamas grazing peacefully. But something about these llamas was different—they were glowing with an otherworldly light, and the travelers felt a sense of calmness wash over them.

One of the travelers, a man named Jackson, approached the llamas cautiously, unsure of what to expect. To his surprise, the llamas didn't flee but instead allowed him to pet their soft fur. As he did so, Jackson felt a strange tingling sensation course through his body. Suddenly, he found himself understanding the llama's language as if it were his own. The llamas were communicating with him, revealing that they held the key to the lost Incan city.

Jackson and his companion, a woman named Emily, spent the next few days learning from the wise llamas. They discovered that the animals possessed ancient knowledge and wisdom, passed down through generations, which allowed them to connect with the spiritual realm. The llamas explained that they were guardians of the valley, protecting its secrets until the right people came along.

With the help of the llamas, Jackson and Emily finally found the lost city, hidden deep within a labyrinth of caves. Inside, they discovered a treasure trove of artifacts, each one holding a piece of the ancient Incan culture's history and wisdom. The llamas had been right - there was a connection between them and the Inca people that went beyond mere coincidence.

As Jackson and Emily returned to their homes, they couldn't shake the sense that something profound had happened. They knew that they had been changed by their encounter with the llamas, and that this experience would stay with them for the rest of their lives. And so, they continued to study and share the wisdom of the Incan civilization, passing it on to future generations.

In a world where technology often overshadows ancient traditions, the llamas remind us that there is still much to learn from our past. Their presence in the valley serves as a reminder to seek out knowledge and embrace the wisdom of those who came before us. And who knows? Maybe one day, we too might find ourselves standing face-to-face with a wise llama, ready to share their ancient secrets.

## Contributors

* [@juliangmolina](https://twitter.com/juliangmolina) - Story and Art
* [@diegoberruecos](https://twitter.com/diegoberruecos) - Illustrations
* [@elisaferreira](https://twitter.com/elisaferreira) - Layouts
* [@julianaponzo](https://twitter.com/julianaponzo) - Proofreading and Production
 [end of text]

Also worked fine when I only offloaded some layers.

So I'm not quite sure why you're getting these problems, but I am confident the GGMLs are OK. Maybe report it to llama.cpp Github?

Same issue. And yes I am on windows :/

Latest commit of llama.cpp (4f6b60c) should resolve the issue.

@dranger003 could you let me know how well this model is working for you with GGML, in terms of prompt templating, and getting the model to stop generating at the right time?

Could you let me know the command line parameters/options you use for this model?

@TheBloke Sure! This appears to work well for me:

bin\Release\main.exe -t 1 -ngl 63 -c 4096 -e -p "GPT4 User: Tell me all about how awesome TheBloke is.<|end_of_turn|>GPT4 Assistant:" -r "<|end_of_turn|>" -m models\openchat_v3.2.ggmlv3.q8_0.bin

Thanks very much!

Sign up or log in to comment