Transformers
GGUF
English
yi
sft
Yi-34B-200K

Getting empty output with latest llama-cpp-python (0.2.18)

#4
by Notel - opened

Other GGUF models work fine, but this one gives an empty output. I've tried both Q4_0 and Q4_K_M. Any idea what I'm doing wrong? I've tried it with several context lengths, here's my code:

llm2 = Llama(model_path=model_path, n_threads=conf.cg['threads'], verbose=True)
for output in llm2("USER: The world is \nASSISTANT:\n", max_tokens=500, stop=["</s>"], temperature=0.7, echo=True, stream=True):

Question:
USER: The world is \nASSISTANT:\n

Answer:
</s>

(sometimes it also answers with a single dot ".")

llm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 64000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 200000
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_head = 56
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 60
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 20480
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 200000
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 30B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 34.39 B
llm_load_print_meta: model size = 18.13 GiB (4.53 BPW)
llm_load_print_meta: general.name = nousresearch_nous-capybara-34b
llm_load_print_meta: BOS token = 144 '
'
llm_load_print_meta: EOS token = 2 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 315 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.20 MB
llm_load_tensors: mem required = 18563.49 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 120.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 140.56 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |

Yeah I see the same - it seems this model is very sensitive to the prompt template. The key seems to be not to add a space and/or newline after ASSISTANT:

Also, it looks like there shouldn't be a newline between USER and ASSISTANT. I have therefore corrected my prompt template to be:

USER: {prompt} ASSISTANT:

Test 1:

# space after ASSISTANT:
-p "USER:\nThe world is\nASSISTANT: "

Output = bad:
USER:\nThe world is\nASSISTANT: .</s> [end of text]

Test 2:

# No space after ASSISTANT:
-p "USER:\nThe world is\nASSISTANT:"

Output = OK:
USER:\nThe world is\nASSISTANT: The world is a vast and diverse place, full of different cultures, languages, and landscapes. It has a rich history that has shaped the lives of ... 

Test 3:

# Same again
-p "USER:\nThe world is\nASSISTANT:"

Output = bad:
USER:\nThe world is\nASSISTANT: full of possibilities.</s> [end of text]

Test 4:

# No newlines, no space on end
-p "USER: The world is ASSISTANT:"

Output = OK:
The world is a vast and complex place, filled with an infinite number of experiences, perspectives, cultures, and ideas. It is a place of wonder, mystery, and awe-inspiring beauty, as well as great challenges and hardships. The world is ...

I did multiple more tests of Test 4 and all were fine, so that's what I've updated the prompt template to be.

Wow it works! Didn't figure to look at the template, appearently it's indeed very sensitive. Thanks so much for the quick response Tom!! Going to put this model to the test now.

For some reason, I only get ok outputs with

# newlines in front, no space on end
"\nUSER: The world is\nASSISTANT:"

Output = OK:
The world is a complex and interconnected place, made up of many different cultures, languages, and people. It is characterized by its natural beauty, diverse landscapes, and rich history. The world is also ...

Sign up or log in to comment