TheBloke/SauerkrautLM-Mixtral-8x7B-Instruct-GGUF · Potential issue with model architecture (maybe an update is needed)

Hi @TheBloke
First off - thanks for the awesome work you do, i cannot emphasize this enough!

While extensively testing SauerkrautLM-Mixtral-8x7B-Instruct in your GGUF spin i noticed that it in like 50% of the cases goes off the rails completely, continuing to produce example output for several pages and then (correctly - e.g. stop token) stopping. I searched around a lot and found that several people observed this with Mixtral-8x7B and that however this was caused by the sliding windows attention as well as some confusion on stop sequence tokens in some cases. Long story short, all branches and maintaines (including @VAGOsolutions) updated their models of the config.json, Mixtral even went as far as to release an 0.2 of their model. However, this is where the confusion stats and i would ask for your quick judgement:

1.) I am not sure if the config.json update is reason enough for you to do a re-run? If it is the cause of the issues then it is likely needed to make the model usable as GGUF.

2.) I am not sure if the @VAGOsolutions model is based on 0.1 or 0.2 (which is said to contain some bugfixes but no details sadly). I'll reach out to them directly and maybe the can also do a refresh.

What i observed is that this model incorrectly gets classified as a llama model even though llama.cpp with GGUF support mixtral natively so this could also be an issue?

See my startup sequence for pointers, any feedback is appreciated:

sauerkrautlm-mixtral-8x7b-instruct.Q5_K_M.gguf (version GGUF V3 (latest))
[2024-04-07 19:54:03] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[2024-04-07 19:54:03] llama_model_loader: - kv   0:                       general.architecture str              = llama
[2024-04-07 19:54:03] llama_model_loader: - kv   1:                               general.name str              = vagosolutions_sauerkrautlm-mixtral-8x...
[2024-04-07 19:54:03] llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
[2024-04-07 19:54:03] llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
[2024-04-07 19:54:03] llama_model_loader: - kv   4:                          llama.block_count u32              = 32
[2024-04-07 19:54:03] llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
[2024-04-07 19:54:03] llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
[2024-04-07 19:54:03] llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
[2024-04-07 19:54:03] llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
[2024-04-07 19:54:03] llama_model_loader: - kv   9:                         llama.expert_count u32              = 8
[2024-04-07 19:54:03] llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
[2024-04-07 19:54:03] llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
[2024-04-07 19:54:03] llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
[2024-04-07 19:54:03] llama_model_loader: - kv  13:                          general.file_type u32              = 17
[2024-04-07 19:54:03] llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
[2024-04-07 19:54:03] llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
[2024-04-07 19:54:03] llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
[2024-04-07 19:54:03] llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
[2024-04-07 19:54:03] llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
[2024-04-07 19:54:03] llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
[2024-04-07 19:54:03] llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
[2024-04-07 19:54:03] llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
[2024-04-07 19:54:03] llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
[2024-04-07 19:54:03] llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
[2024-04-07 19:54:03] llama_model_loader: - kv  24:               general.quantization_version u32              = 2
[2024-04-07 19:54:03] llama_model_loader: - type  f32:   65 tensors
[2024-04-07 19:54:03] llama_model_loader: - type  f16:   32 tensors
[2024-04-07 19:54:03] llama_model_loader: - type q8_0:   64 tensors
[2024-04-07 19:54:03] llama_model_loader: - type q5_K:  833 tensors
[2024-04-07 19:54:03] llama_model_loader: - type q6_K:    1 tensors
[2024-04-07 19:54:03] llm_load_vocab: special tokens definition check successful ( 259/32000 ).
[2024-04-07 19:54:03] llm_load_print_meta: format           = GGUF V3 (latest)
[2024-04-07 19:54:03] llm_load_print_meta: arch             = llama
[2024-04-07 19:54:03] llm_load_print_meta: vocab type       = SPM
[2024-04-07 19:54:03] llm_load_print_meta: n_vocab          = 32000
[2024-04-07 19:54:03] llm_load_print_meta: n_merges         = 0
[2024-04-07 19:54:03] llm_load_print_meta: n_ctx_train      = 32768
[2024-04-07 19:54:03] llm_load_print_meta: n_embd           = 4096
[2024-04-07 19:54:03] llm_load_print_meta: n_head           = 32
[2024-04-07 19:54:03] llm_load_print_meta: n_head_kv        = 8
[2024-04-07 19:54:03] llm_load_print_meta: n_layer          = 32
[2024-04-07 19:54:03] llm_load_print_meta: n_rot            = 128
[2024-04-07 19:54:03] llm_load_print_meta: n_embd_head_k    = 128
[2024-04-07 19:54:03] llm_load_print_meta: n_embd_head_v    = 128
[2024-04-07 19:54:03] llm_load_print_meta: n_gqa            = 4
[2024-04-07 19:54:03] llm_load_print_meta: n_embd_k_gqa     = 1024
[2024-04-07 19:54:03] llm_load_print_meta: n_embd_v_gqa     = 1024
[2024-04-07 19:54:03] llm_load_print_meta: f_norm_eps       = 0.0e+00
[2024-04-07 19:54:03] llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
[2024-04-07 19:54:03] llm_load_print_meta: f_clamp_kqv      = 0.0e+00
[2024-04-07 19:54:03] llm_load_print_meta: f_max_alibi_bias = 0.0e+00
[2024-04-07 19:54:03] llm_load_print_meta: n_ff             = 14336
[2024-04-07 19:54:03] llm_load_print_meta: n_expert         = 8
[2024-04-07 19:54:03] llm_load_print_meta: n_expert_used    = 2
[2024-04-07 19:54:03] llm_load_print_meta: rope scaling     = linear
[2024-04-07 19:54:03] llm_load_print_meta: freq_base_train  = 1000000.0
[2024-04-07 19:54:03] llm_load_print_meta: freq_scale_train = 1
[2024-04-07 19:54:03] llm_load_print_meta: n_yarn_orig_ctx  = 32768
[2024-04-07 19:54:03] llm_load_print_meta: rope_finetuned   = unknown
[2024-04-07 19:54:03] llm_load_print_meta: model type       = 7B
[2024-04-07 19:54:03] llm_load_print_meta: model ftype      = Q5_K - Medium
[2024-04-07 19:54:03] llm_load_print_meta: model params     = 46.70 B
[2024-04-07 19:54:03] llm_load_print_meta: model size       = 30.02 GiB (5.52 BPW) 
[2024-04-07 19:54:03] llm_load_print_meta: general.name     = vagosolutions_sauerkrautlm-mixtral-8x7b-instruct
[2024-04-07 19:54:03] llm_load_print_meta: BOS token        = 1 '<s>'
[2024-04-07 19:54:03] llm_load_print_meta: EOS token        = 2 '</s>'
[2024-04-07 19:54:03] llm_load_print_meta: UNK token        = 0 '<unk>'
[2024-04-07 19:54:03] llm_load_print_meta: PAD token        = 0 '<unk>'
[2024-04-07 19:54:03] llm_load_print_meta: LF token         = 13 '<0x0A>'
[2024-04-07 19:54:03] llm_load_tensors: ggml ctx size       =    0.38 MiB
[2024-04-07 19:54:03] llm_load_tensors: using CUDA for GPU acceleration
[2024-04-07 19:54:08] llm_load_tensors: system memory used  =   86.32 MiB
[2024-04-07 19:54:08] llm_load_tensors: VRAM used           = 30649.55 MiB
[2024-04-07 19:54:08] llm_load_tensors: offloading 32 repeating layers to GPU
[2024-04-07 19:54:08] llm_load_tensors: offloading non-repeating layers to GPU
[2024-04-07 19:54:08] llm_load_tensors: offloaded 33/33 layers to GPU

Best

Robert