Text Generation
Transformers
GGUF
English
Chinese
llama
llama2
qwen
text-generation-inference

error loading model: cannot find tokenizer merges in model file

#2
by tranhoangnguyen03 - opened

error loading model: cannot find tokenizer merges in model file

llama_load_model_from_file: failed to load model
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/main.py", line 96, in
app = create_app(settings=settings)
File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/app.py", line 343, in create_app
llama = llama_cpp.Llama(
File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 365, in init
assert self.model is not None
AssertionError

Apologies, the originally uploaded GGUFs had an error. Please try re-downloading; the newly uploaded GGUFs are confirmed to work with latest llama.cpp.

I tried this model using latest llama.cpp,still have problem:

C:\llama.cpp>main.exe -m D:\llmmodels\TheBloke\CausalLM-7B-GGUF\causallm_7b.Q3_K_M.gguf -ngl 18 -p "introduce yourself."
Log start
main: build = 1419 (e393259)
main: built with MSVC 19.35.32217.1 for x64
main: seed = 1698132297
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA T400 4GB, compute capability 7.5
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from D:\llmmodels\TheBloke\CausalLM-7B-GGUF\causallm_7b.Q3_K_M.gguf (version unknown)
llama_model_loader: - tensor 0: token_embd.weight q3_K [ 4096, 151936, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q3_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q3_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_v.weight q5_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q3_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_up.weight q3_K [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.ffn_down.weight q5_K [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
。。。。。。。。

llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.merges arr
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 20: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q3_K: 129 tensors
llama_model_loader: - type q4_K: 92 tensors
llama_model_loader: - type q5_K: 4 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 293/151936 vs 85/151936 ).
llm_load_print_meta: format = unknown
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 109170
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q3_K - Medium
llm_load_print_meta: model params = 7.72 B
llm_load_print_meta: model size = 3.64 GiB (4.05 BPW)
llm_load_print_meta: general.name = causallm_7b
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 30 '?'
llm_load_tensors: ggml ctx size = 0.10 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 2057.64 MB
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloaded 18/35 layers to GPU
llm_load_tensors: VRAM used: 1672.59 MB
..................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_new_context_with_model: compute buffer total size = 310.88 MB
llama_new_context_with_model: VRAM scratch buffer: 304.75 MB
llama_new_context_with_model: total VRAM used: 1977.34 MB (model: 1672.59 MB, context: 304.75 MB)

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

introduce yourself. Where
CUDA error 9 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:6862: invalid configuration argument
current device: 0

Sign up or log in to comment