Transformers
GGUF
mistral
text-generation-inference

These models that use rope behave very weird (dumb) with default settings of 32768 in latest Oobabooga text generation webui.

#6
by CR2022 - opened

Admin

What is your max context length in tokens?
Openchat_3.5-16k.Q5_K_M

The maximum token limit per response is currently 4096. Please feel free to ask any questions or topics I can help you with :)
Admin

In Oobabooga your context length is set to 32768
Openchat_3.5-16k.Q5_K_M

Thank you for the information, please let me know if there's anything else that I may assist. You have some strange things in my character creation.
Admin

I think the problem is related to that you tell me that your max content length in tokens is 4096 but your max tokens length is set to 32768 tokens they use technologies like rope scaling or rope frequency does that mean anything to you?
Openchat_3.5-16k.Q5_K_M

The maximum token limit per response is actually 4096, but you can ask me for help on any questions or subjects!
Admin

Do you know what ctx value is?
Openchat_3.5-16k.Q5_K_M

The current maximum token count is 2048 but your context length is set to 3276
Admin

No it is set to 32768 tokens that is also your ctx value.
Openchat_3.5-16k.Q5_K_M

I apologize for any confusion, please let me know if you have questions or topics where I may help you with :)
Admin

The problem is I am trying to have a diagnose conversion with you but you do not understand me and that is what I mean I am trying to figure out together with you what the problem is may I show to you terminal output debug information of your language model?
Openchat_3.5-16k.Q5_K_M

It seems like there's an error in my system. Please let me know if you have questions or subjects where I can assist you! :)
Admin

May I show your diagnostic information of your language model file?
Openchat_3.5-16k.Q5_K_M

174 tokens is the maximum length of my response limit per message

Any idea how to solve it? Or is it just that these models using rope scaling become dumber the more you talk to them? I estimate that they start to become dumber and repetitive when they exceed the 4096 tokens they seem to believe they are created with. They are not at all using the whole 32K context at least not in a clean install of the latest Oobabooga text generation webui with the default assistant template and the default model settings provided by the gguf.

llm_load_print_meta: format = unknown
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32002
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.78 GiB (5.67 BPW)
llm_load_print_meta: general.name = nurtureai_openchat_3.5-16k
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 32000 '<|end_of_turn|>'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.09 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 4893.10 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 4096.00 MB
llama_new_context_with_model: compute buffer total size = 2141.88 MB
llama_new_context_with_model: VRAM scratch buffer: 2136.00 MB
llama_new_context_with_model: total VRAM used: 2136.00 MB (model: 0.00 MB, context: 2136.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
2023-11-11 15:51:08 INFO:Loaded the model in 6.74 seconds.
Output generated in 16.39 seconds (3.42 tokens/s, 56 tokens, context 72, seed 1017857071)
Llama.generate: prefix-match hit
Output generated in 18.82 seconds (3.77 tokens/s, 71 tokens, context 195, seed 249506680)
Llama.generate: prefix-match hit

The problem is that you are expecting the model to know its context and where it’s loaded. The model has 0% idea about where it’s loaded and pretty much 0% idea about its context as well.

This stuff is usually told in the system prompt. Even ChatGPT or gpt4 would react the same unless you told it in the system prompt.

This was just an example the whole model fails to reason about pretty much anything if it exceeds 4096 tokens. It behaves normally in the beginning but then it goes dumb and crazy.

Basically you have a 4096 tokens model that is using rope to extend context to supposedly 32K tokens but it goes crazy or dumb when it exceeds around 4096 tokens.

@CR2022 I think you dont understand how language models work.
First put, any model even gpt4 or chatgpt has no idea about where its loaded or how much context it has unless its trained on many questions and answers on it.

You can kind of solve it by putting it on the system prompt like you are an ai loaded in text generation web ui and you have 32k context. But it needs to even know about text generation web ui which is kind of new and most models dont know about it.

Also im not sure if u are using the correct prompt format

Its not that its dumb but rather you arent asking it correctly.

Sign up or log in to comment