https://huggingface.co/ewald1976/oracle-omega-24b

#2467
by ewald1976 - opened

NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

sadly broken for some reason, did you use correct tokenizer?

Hello,
The weights should be healthy, I already converted and quantized the model myself via gguf-my-repo without issues, so this isn't a broken-merge situation. The problem might be the tokenizer hash. The model uses the standard Mistral Small 3.2 Tekken tokenizer, but the tokenizer.json is a transformers-converted Tekken file, so its chkhsh isn't in llama.cpp's known list. I tried to swap in a canonical tokenizer.json, but the official mistralai/Mistral-Small-3.2-24B-Instruct-2506 repo only ships tekken.json — there's no upstream tokenizer.json with a registered hash to copy in.

Given that, would you be willing to map the pre-tokenizer to "tekken" for this model so the conversion can complete? If there's anything I can do on my side to make it easier — adjust files in the repo, provide additional info — just let me know.
No urgency whatsoever, and thank you again for all the work you share with everyone.

If there is still a problem, just close the request and I will try to remerge.

Thank you very much.

Hi, adjusting to known tokenizer would definitely help as it is the step that breaks it, is it possible to do in repo ? We only work with main llamacpp, so not able to adjust anything on my side

Done — I've replaced tokenizer.json and tokenizer_config.json with the ones from unsloth/Mistral-Small-3.2-24B-Instruct-2506. Both committed cleanly (the tokenizer.json has a different hash than before), so mainline llama.cpp should recognize it now. Could you give it another try when you have a moment? Thank you!

queued with high priority while we are both here just to see if it works or not =)

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#oracle-omega-24b-GGUF for quants to appear.

Thank you so much!

Is there a problem with newer MergeKit versions? Because there have been numerous failures for Mistral 24B based models recently.

Perhaps mergekit saves with different tokenizer, because someone requested (I dont remember, possible mergekit mistral too), he had to replace tokenizer with known tokenizer and it might have worked. So perhaps all it needs is that mergekit fixes tokenizer saving

@Naphula you are the expert for merging. Does @RichardErkhov 's helpful suggestion make sense? I remember you have been experimenting with post-merge tokenisation healing.

This merge

https://huggingface.co/ewald1976/oracle-omega-24b

And this merge

https://huggingface.co/ShyliaSafetensors/EnceladusHyperStock-24B

Are missing the tokenizer lines entirely from the YAML.

When no tokenizer_source is specified, the expected behavior is to use the base model's tokenizer.

If the base model doesn't have a correct tokenizer, this may cause issues.

I suggest re-merging with

   tokenizer:
     source: union

or with

   tokenizer:
     source: (one of the models here)

You may also want to consider adding chat_template: auto to the yaml.

Both Oracle Omega and Enceladus Hyper Stock are merging 2501 with 2506 models. In these cases I've always used tokenizer source union with chat_template auto. Like here

https://huggingface.co/Naphula/Slimaki-24B-v1.2

The fact Morax v2 is also having issues quantizing might be another issue altogether.

If all else fails you (or other quantizers) may have to fiddle with other versions of python libraries as seen here https://huggingface.co/datasets/Naphula-Archives/master_python_list_mergekit_windows

Newer versions of llama.cpp might be broken, or possibly mismatched python versions on @RichardErkhov 's end.

I'm not sure why Morax is having trouble so will likely try to upload quants for this myself.

Most of my latest experiments have been with 12B where I have to use passthrough merge just to fix broken chatml tokenizers with mistral tekken. This is mostly useful for merging base with instruct as you can get much more creativity out of base finetunes I noticed (but they are stupid so need enough instruct merged, for this I like arcee_fusion and am still refining the new Raven 12B)

The method is described here https://huggingface.co/OccultAI/Qliphoth-12B-v1.2 but might only be applicable to 12B models since I don't know of too many ChatML 24Bs.

In my tests, simply swapping tokenizer jsons from another model usually does not work. But it's worth a shot because you will get instant errors trying to quantize if it doesn't.

LLM advice below (not sure how accurate)

Default Behavior When No Tokenizer Source is Specified

When no tokenizer or tokenizer_source field is specified in your YAML configuration, MergeKit uses a fallback mechanism to handle tokenizers:

  1. No tokenizer task is created: In MergePlanner.__init__(), the BuildTokenizer task is only instantiated if tokenizer_source is not None 1 .

  2. Fallback to copying tokenizer: During merge execution in run_merge(), if no tokenizer was built by the task system, it attempts to copy the tokenizer from a donor model when options.copy_tokenizer is True (which is the default) 2 .

  3. Copy method: The _copy_tokenizer() function either directly copies tokenizer files (tokenizer.json, tokenizer_config.json, etc.) from the donor model, or falls back to loading and reserializing the tokenizer using transformers.AutoTokenizer 3 .

Debugging the Mistral 24B Tokenizer Errors

The error NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre() is not from MergeKit but from llama.cpp during GGUF conversion. This indicates the issue is with the tokenizer format that MergeKit saves, not the merge itself.

Root Cause Analysis

Based on the discussion and code:

  1. Tekken tokenizer format: Mistral Small 3.2 uses the Tekken tokenizer, but when MergeKit reserializes it via transformers.AutoTokenizer.from_pretrained() and save_pretrained(), it may produce a tokenizer.json format that llama.cpp doesn't recognize 4 .

  2. Hash mismatch: The transformers-converted Tekken tokenizer.json has a different hash than what llama.cpp expects, causing the conversion to fail.

  3. Missing canonical tokenizer: The official Mistral repo only ships tekken.json, not a tokenizer.json with a registered hash that llama.cpp recognizes.

Potential Solutions

  1. Explicit tokenizer configuration: Specify a known-good tokenizer in your YAML:

    tokenizer:
      source: "base"  # or a specific model with known-good tokenizer
    
  2. Post-merge tokenizer replacement: As discussed in the thread, manually replace the tokenizer.json and tokenizer_config.json with versions from a known-good source like unsloth/Mistral-Small-3.2-24B-Instruct-2506 after the merge.

  3. Use mergekit-tokensurgeon: For more control over tokenizer transplantation, consider using the specialized mergekit-tokensurgeon tool which provides approximation methods for handling tokenizer differences 5 .

The issue appears to be a compatibility problem between MergeKit's tokenizer serialization and llama.cpp's expectations, rather than a bug in MergeKit's merging logic itself.

Notes

The test suite confirms that when no tokenizer_source is specified, the expected behavior is to use the base model's tokenizer 6 . The errors you're seeing are specific to downstream tools (llama.cpp) not recognizing the tokenizer format, not MergeKit's merge functionality.

As you can see with @Vortex5 merges, a common procedure is to select the most stable model you have for the tokenizer. This usually works well if union causes issues. This is what I did with the Qliphoth merges in order to fix the endless generation missing EOS token.

Newer versions of llama.cpp might be broken, or possibly mismatched python versions on @RichardErkhov 's end.

we are using upto a week old llama cpp, and other models (that suppose to work) are working except merged mistrals, (I think if I remember correctly, my memory is currently a state of ram that has no power) all failing with unrecognized tokenizer. possibly just replace the tokenizer with original mistral tokenizer? that could hopefully help? but it needs to be done on model's end, not mine

and as I noticed, only past few days so many mistrals failed? so you might be right about broken llamacpp, or maybe my memory is broken 🤔

maybe i can upload a few tests (tokenizer variations) you can try to quantise to narrow down the root cause.

you had problems with 12B Nemo as well, not just 24B?

it looks like you were able to quantize Qliphoth v1.2 (Nemo) recently so this may narrow it down a bit to just 24B

You could also try rolling back to this possibly

https://github.com/ggml-org/llama.cpp/releases/tag/b9080

This is the one I downloaded for Gemma 4 quantization and its also stable for Quantizing mistral 24B.

You may have to eventually have separate 'branches' for quantizing different archs, like how I now have to use an entire python library hotswap for merging either G4 or older mistrals. I think even with runpod now you have to use one or the other

Thanks, that makes sense. I did not explicitly set tokenizer handling in the original merge YAML, so the tokenizer fallback may have produced a bad Tekken/tokenizer.json combination. I’ll re-merge with tokenizer:
source: union and chat_template: auto, or alternatively pin the tokenizer to a stable 2506 source model, then re-submit. Appreciate the detailed explanation.

Sign up or log in to comment