Unable to use GGUF file with llama.cpp
#11
by
iambulb
- opened
Error received when tried to use with llama.cpp:
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.block_count u32
llama_model_loader: - kv 3: llama.context_length u32
llama_model_loader: - kv 4: llama.embedding_length u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.attention.head_count u32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32
llama_model_loader: - kv 8: llama.rope.freq_base f32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: llama.vocab_size u32
llama_model_loader: - kv 12: llama.rope.dimension_count u32
llama_model_loader: - kv 13: tokenizer.ggml.model str
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.merges arr
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 20: tokenizer.chat_template str
llama_model_loader: - kv 21: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
error loading model: cannot find tokenizer scores in model file
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './Lexi-Llama-3-8B-Uncensored_Q4_K_M.gguf'
load_binding_model: error: unable to load model
Loading the model failed: failed loading model
exit status 1
In addition, also tried to convert a working copy of Llama-3-8B-Lexi-Uncensored
to GGUF using instructions given in llama.cpp repo:
# obtain the official LLaMA model weights and place them in ./models
ls ./models
llama-2-7b tokenizer_checklist.chk tokenizer.model
# [Optional] for models using BPE tokenizers
ls ./models
<folder containing weights and tokenizer json> vocab.json
# [Optional] for PyTorch .bin models like Mistral-7B
ls ./models
<folder containing weights and tokenizer json>
# install Python dependencies
python3 -m pip install -r requirements.txt
# convert the model to ggml FP16 format
python3 convert-hf-to-gguf.py models/mymodel/
# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
# update the gguf filetype to current version if older version is now unsupported
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
Able to convert to GGUF. But still getting the same error when trying to use it.
I believe it has some issue with the tokenizer scores.
This was fixed by ngxson's #9117 (https://github.com/ggerganov/llama.cpp/pull/9117#event-13991670919) on GitHub.
iambulb
changed discussion status to
closed