Missing Tensors in Q5_K_S + Q5_K_M
Hey, there seem to be missing tensors. Llama cpp reports the following error:
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 724, got 723
Version: Q5_K_S.gguf
Downloaded size: 48657451584
@Digital-At-Work-Christopher you may need to re-download the file, it was updated a couple days ago with the newest llama.cpp fixes
I use the latest version of llama cpp and the uploaded *.gguf file from this repo.
I downloaded the gguf yesterday and today again.
Also tested the Q5_K_M now. Same error.
Something is off then, because if you inspect the model itself on HF here: https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/tree/main?show_file_info=Meta-Llama-3.1-70B-Instruct-Q5_K_S.gguf
You can see it has the proper 724 tensors.. I'll redownload and double check but I'm not sure how it would go wrong
can you give me the exact command you're running that's giving the error?
I just redownloaded it and ran it in llama.cpp without any issue :(
My gosh... I am sorry, I checked everything one more time and noticed that I apparently forgot to update docker compose after building. Works now!
ah okay good good, i'm glad it was just a user error, those are easiest for me to correct ;D
I'm having the same problem with the 8Q model and llama.cpp
same with 8Q
can you share your commands that you're running?
sha256sum ../models/Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf ../models/Meta-Llama-3.1-70B-Instruct-Q8_0-00002-of-00002.gguf
e2550873a3b7189ff35569411f36eb04a5c69d6ecb459874e6943e12b0f54ec9 ../models/Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf
4fce1532438024d8e1537879156258c3a935783c8dbe75a109c87cfd4dfc38a6 ../models/Meta-Llama-3.1-70B-Instruct-Q8_0-00002-of-00002.gguf
./llama-cli -m ../models/Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf -n -1 -e --mirostat 2 --temp 0.3 --repeat_penalty 1.1 --n-gpu-layers 10 --conversation -b 32000 --flash-attn --multiline-input -p "null" -c 32000
Log start
main: build = 3466 (01aec4a6)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed = 1724697065
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from ../models/Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 80
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 7
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/Meta-Llama-3.1-70B-Instru...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - kv 33: split.no u16 = 0
llama_model_loader: - kv 34: split.count u16 = 2
llama_model_loader: - kv 35: split.tensors.count i32 = 724
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q8_0: 562 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 69.82 GiB (8.50 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0.68 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 724, got 723
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../models/Meta-Llama-3.1-70B-Instruct-Q8_0-00001-of-00002.gguf'
main: error: unable to load model
oh well yours is a bit easier @james73686 , you're running a llama.cpp that's older than the change that made this work. You're running llama.cpp from July 25, and the fix for this was added on July 27th
Your build: https://github.com/ggerganov/llama.cpp/commit/01aec4a6
The fixed build: https://github.com/ggerganov/llama.cpp/commit/b5e95468b1676e1e5c9d80d1eeeb26f542a38f42
So please update your repo and rebuild and try again
Thanks. That's what the problem was.