'output_norm.weight' not found

#36
by harryballantyne - opened

Downloaded the Q5_K_M model but can't seem to get it running with llama-cpp-python. Anyone got any idea how to fix this? The error I'm receiving is as follows:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 3 CUDA devices:
Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
llama_model_loader: loaded meta data with 29 key-value pairs and 150 tensors from /scicore/home/meinlsch/ballan0000/LLMs/gguf/mixtral_8x22b/Mixtral-8x22B-Instruct-v0.1.Q5_K_M-
00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = models--mistralai--Mixtral-8x22B-Inst...
llama_model_loader: - kv 2: llama.block_count u32 = 56
llama_model_loader: - kv 3: llama.context_length u32 = 65536
llama_model_loader: - kv 4: llama.embedding_length u32 = 6144
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 16384
llama_model_loader: - kv 6: llama.attention.head_count u32 = 48
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.expert_count u32 = 8
llama_model_loader: - kv 11: llama.expert_used_count u32 = 2
llama_model_loader: - kv 12: general.file_type u32 = 17
llama_model_loader: - kv 13: llama.vocab_size u32 = 32768
llama_model_loader: - kv 14: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32768] = ["", "", "", "[INST]", "[...
llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32768] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32768] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {{bos_token}}{% for message in messag...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: split.no u16 = 0
llama_model_loader: - kv 27: split.count u16 = 4
llama_model_loader: - kv 28: split.tensors.count i32 = 563
llama_model_loader: - type f32: 29 tensors
llama_model_loader: - type f16: 15 tensors
llama_model_loader: - type q8_0: 30 tensors
llama_model_loader: - type q5_K: 67 tensors
llama_model_loader: - type q6_K: 9 tensors
llm_load_vocab: mismatch in special tokens definition ( 1027/32768 vs 259/32768 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32768
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 65536
llm_load_print_meta: n_embd = 6144
llm_load_print_meta: n_head = 48
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 56
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 6
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 16384
llm_load_print_meta: n_expert = 8
llm_load_print_meta: n_expert_used = 2
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 65536
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 37.76 B
llm_load_print_meta: model size = 25.14 GiB (5.72 BPW)
llm_load_print_meta: general.name = models--mistralai--Mixtral-8x22B-Instruct-v0.1
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 781 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MiB
llama_model_load: error loading model: create_tensor: tensor 'output_norm.weight' not found
llama_load_model_from_file: failed to load model

Yeah I've got the same problem. I merged the 4 sharded files of Q5_K_M into one, and then i get this error

These are not just a simple split, these are shards of GGUF models. You don't need to merge them: https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUF#load-sharded-model

If you must, you need to use the native GGUF Merge function to do this. But you don't have to, it can work with splits as they are.

Sign up or log in to comment