Error loading the model - missing tensor weight?

#4
by mancub - opened

Latest llama.cpp pull and latest download of the model, results in an error:

llama.cpp: loading model from models/thebloke_vicunlocked-30b-lora.ggml.v3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.12 MB
error loading model: llama.cpp: tensor 'layers.55.attention_norm.weight' is missing from model
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/thebloke_vicunlocked-30b-lora.ggml.v3.q8_0.bin'
main: error: unable to load model

Sorry, it looks like the q8_0 file didn't upload properly. I will fix that now. Will likely take an hour or so.

New q8_0 file is now uploaded, apologies for the delay.

New q8_0 file is now uploaded, apologies for the delay.

No need to apologize. We work on your time here - whenever you are ready.

Downloading the new file now but still got 1 hr to go.

It's working now, but slower than the other models. I was able to squeeze 41 out of 60 layers into my 3090 from the q8_0 model. :)

Weirdly enough I got a "USER: continue" from it mid-sentence while it was writing the response. Then it backed off, repeated the last sentence where it stopped, and carried on.

llama_print_timings: load time = 18930.83 ms
llama_print_timings: sample time = 425.43 ms / 687 runs ( 0.62 ms per token)
llama_print_timings: prompt eval time = 2841.70 ms / 24 tokens ( 118.40 ms per token)
llama_print_timings: eval time = 277954.03 ms / 686 runs ( 405.18 ms per token)
llama_print_timings: total time = 297548.04 ms

Going to try a more quantized version next. Although based on my testing of the Manticore model, it appears that q8_0 is faster than q5_1. So maybe I'll try q5_0 instead as most of the layers might fit into the 3090.

This models also does not want to stop rambling. Even in the instruction mode, it completes the response and then it gives itself another instruction, similar to the original one, and just keeps going.

I guess it's like Grandpa Simpson, minus the narcolepsy, LOL.

Yeah a lot of people are reporting the auto-continue/rambling issue with Manticore. The model's creator is looking at it, but is not yet sure why it's happening.

I know from my own experience that it doesn't happen with the INSTRUCT / RESPONSE template, eg when I tested 10 or so single prompts like this, I never got the issue:

-p "###Instruction: Write an essay comparing France and Germany\n### Response:
-p "###Instruction: what is pythagorus theorem? Give some examples\n### Response:"
-p "###Instruction: explain in detail the differences between C, C++ and Objective C\n### Response:"

I'm talking about VicUnlocked though, not Manticore. I asked it to write me a story about llamas (the generic instruction) and it went to write several more before I CTRL-C-ed out of it.

Oh right yeah! Getting confused between all the models.

I never really tested this model so can't say if that's usual or not. According to the original model card, it's a Vicuna that's been converted to "more like Alpaca style", using "some of Vicuna 1.1"

Vicuna 1.0 was very strict with prompt template. You had to use this format else it wouldn't reply at all or would run on and on:

### Human: Write a story about llamas
### Assistant:

Maybe because this is a conversion from Vicuna to Alpaca, some of that issue is not fixed? Are you using Alpaca template at the moment? Have you tried Vicuna template like above?

That did it ! Should've paid more attention to what was derived from where. :)

There are so many models now, and with all these small nuances about their use, it's hard to keep track of everything.

Sign up or log in to comment