Problems....

#3
by Nerdsking - opened

llama barely announced the version with MTP to Step models, and you already made this. Keep the good job. Congratulations.

Nerdsking changed discussion title from Nice and fast Job to Problems....

Yep... But got this while trying to load the Q5_K_M...
0.00.803.436 E llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 805, got 790
0.00.803.444 E llama_model_load_from_file_impl: failed to load model
0.00.803.497 E srv load_model: [spec] failed to measure MTP context memory: failed to load model

Just to make sure, you did pull and recompile? Because it loads and runs when I was on master which is how I ran the PPL / KLD testing.

I took my last version directly from the releases page, so it comes already compiled (windows 11 - cuda 13.3). I am using llama-b9484. https://github.com/ggml-org/llama.cpp/releases. The MTP for Stepfun is ready since release b9480.

Looks like I had a newer version?

$ ./build/bin/llama-server --version
version: 9481 (bfb4308b0)
built with GNU 15.2.1 for Linux x86_64

I just loaded it up to double check:

../llama.cpp/build/bin/llama-server \
    --threads 54 --batch-size 4096 --ubatch-size 4096 --fit-target 4096,4096,4096,4096,4096,4096,4096,4096 --direct-io \
    --ctx-size 262144 --flash-attn on --port 10000 --host 0.0.0.0 --log-prefix --log-timestamps \
    --model /mnt/srv/snowdrift/gguf/Step-3.7-Flash-GGUF/aes_sedai/Step-3.7-Flash-Q5_K_M.gguf --alias "Step-3.7-Flash Q8_0" \
    --parallel 1 --spec-type draft-mtp --spec-draft-n-max 2

....

1.11.742.612 I srv    load_model: creating MTP draft context against the target model '/mnt/srv/snowdrift/gguf/Step-3.7-Flash-GGUF/aes_sedai/Step-3.7-Flash-Q5_K_M.gguf'
1.11.870.515 I srv    load_model: initializing slots, n_slots = 1
1.11.898.909 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
1.11.898.919 I common_speculative_impl_draft_mtp: - n_max=2, n_min=0, p_min=0.00, n_embd=4096, backend_sampling=1
1.11.898.921 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]

and the server came up without issues. Did you download the new quants? They were just updated a little while ago.

My version is newer than yours. But I will download that exact version you are using. It could be a problem introduced by the latest release. I will downloaded your model again, I deleted it. And yes, I downloaded as soon you made the update in your page announcing the MTP version.

There was a short window where I had messed the upload up and had to fix it, so I had to re-upload the splits. Maybe you ended up with one good split and one bad split or something and that caused the issue. The split files are all correct now.

Ah, must be this. I downloaded again and now is working just fine. No problem at all. Thanks for your nice work.

Glad it's sorted! 🤗

Sign up or log in to comment