Transformers
GGUF
English
yi
sft
Yi-34B-200K

BOS token as 1 seriously hurts these GGUF Yi models

#1
by KerfuffleV2 - opened

They weren't trained for it. I have a pull open to try to resolve the issue: https://github.com/ggerganov/llama.cpp/pull/4040

Note there's a part in there that affects conversion as there was a bug which could prevent the add_blah_token booleans from getting added to the metadata if tokenizer.json didn't exist.

It's possible to fix models after the fact using the gguf-set-metadata.py utility in the llama.cpp repo. You can try setting the BOS token id to the same as EOS (2). Setting it to 144 (Yi's newline) seems to work better. Example assuming you're in the llama.cpp repo:

gguf-py/scripts/gguf-set-metadata.py some-yi-model.gguf tokenizer.ggml.bos_token_id 144

The tool will report the original value, so if for whatever reason you prefer the original behavior you can just set it back. (Believe it's just 1 for all these models.)

This is something you want to do for all GGUF Yi models currently available: this one, the Dolphin one, the originals.

@TheBloke just wanted to ping you and see what you think about this.. is that something you could try?

Yes, doing it now

Thanks so much for the script and the instructions, @KerfuffleV2 !

I've done it for this model and the updated GGUFs will start uploading shortly. I will do Dolphin and the originals next.

An example Q4_K_M generation

USER: write a story about llamas ASSISTANT: Once upon a time, in the heart of the Andes Mountains, there was a small village named Llama-land. This village was known for its beautiful scenery and its love for animals, especially llamas. The people of Llama-land had domesticated these gentle creatures centuries ago, using them for transportation, agriculture, and even as companions.

In the center of the village, there was a large llama pasture where hundreds of llamas grazed peacefully during the day. They were cared for by expert herders who understood their behaviors and needs perfectly. The villagers believed that these animals brought good luck and prosperity to their community.

Among all the llamas in Llama-land, there was one named Lucky. He was a young male with beautiful brown fur, bright eyes, and an extremely friendly personality. Everyone who met him fell in love instantly, including a little girl named Mariana. She visited the llama pasture every day after school to spend time with her favorite friend, Lucky.

One sunny afternoon, while Mariana was playing with Lucky, she noticed something strange on his back - a small white patch shaped like a heart. The villagers had never seen anything like it before and soon word spread about this unusual marking on the beloved llama's fur. They all gathered around to see this amazing sight for themselves, marveling at the beauty of nature.

As days turned into weeks, the heart-shaped mark became more visible, and people from faraway places started visiting Llama-land just to meet Lucky. The villagers were proud of their special llama and took great care of him, ensuring he remained happy and healthy.

One day, a group of researchers arrived in Llama-land hoping to study the unique heart-shaped marking on Lucky's back. They believed that this rare occurrence could provide valuable insights into the genetics and behavior of llamas. The villagers welcomed them with open arms, eager to share their knowledge and love for these amazing animals.

After months of studying Lucky and his genetic makeup, the researchers discovered something incredible – the heart-shaped marking was not just a coincidence but rather a result of specific gene combinations that occurred very rarely among llamas. They also found that Lucky had unique personality traits compared to other llamas, making him even more special.

The news about Lucky's scientific significance spread across the globe, and Llama-land became a popular destination for tourists who wanted to meet the famous heart-shaped llama. The villagers took advantage of this opportunity by starting businesses related to tourism, such as guided tours, souvenir shops, and traditional food stalls.

Despite all the attention, Lucky remained humble and true to his nature. He continued spending his days grazing peacefully alongside Mariana, who never forgot how special their friendship was. The people of Llama-land cherished their bond with these amazing creatures even more than before, knowing that their love for llamas had made a significant impact on the world.

And so, the story of Lucky, the heart-shaped llama, lived on in the hearts and minds of everyone who visited Llama-land. His unique marking served as a reminder of the wonders of nature and the deep connection between humans and animals, inspiring people to appreciate and protect these incredible creatures for generations to come.</s> [end of text]

Updated GGUFs are now uploaded

Unfortunately there is another issue with the GGUF - not related to quality, but relating to CUDA GPU acceleration:

$ CUDA_VISIBLE_DEVICES=6 ./main -m /workspace/process/nousresearch_nous-capybara-34b/gguf/nous-capybara-34b.Q4_K_M.gguf -c 4096 -p "USER: write a story about llamas ASSISTANT:" -ngl 100
...
llm_load_print_meta: BOS token = 144 '
'
llm_load_print_meta: EOS token = 2 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 315 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.20 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  246.29 MB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: VRAM used: 19454.15 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 960.00 MB
llama_new_context_with_model: kv self size  =  960.00 MB
llama_build_graph: non-view tensors processed: 1384/1384
llama_new_context_with_model: compute buffer total size = 499.57 MB
llama_new_context_with_model: VRAM scratch buffer: 498.00 MB
llama_new_context_with_model: total VRAM used: 20912.15 MB (model: 19454.15 MB, context: 1458.00 MB)

system_info: n_threads = 56 / 112 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0



USER: write a story about llamas ASSISTANT:<h3><h3>
CUDA error 716 at ggml-cuda.cu:7104: misaligned address
current device: 0

This isn't related to the BOS update, the same problem occurs with the original GGUF as well.

@KerfuffleV2 do you know if this is expected atm with Llamafied Yi? If not I will raise it.

Just chiming in, I've been using this model (after patching it myself) with KoboldCpp, there's no CUDA issue there when offloading to GPU.

And, wow, this model has been doing exceptionally well in my preliminary tests!

@TheBloke

do you know if this is expected atm with Llamafied Yi?

I'm not completely sure what you mean, are you asking about the CUDA error? I think your issue here is probably because of using multi-GPU but I'm not sure. The Yi-based models work fine for me with ROCM.

If you mean the exact -llamified model, as far as I know the only thing they did was change the name of two tensors to match the normal LLaMA convention. However, GGUF conversion normalizes the name. In other words, there shouldn't be any difference between the original "Yi" and "Yi-llamified" so you shouldn't need to provide GGUF quants for that. (Guess this could be an opportunity to test out that gguf-checksum script I mentioned.)


And, wow, this model has been doing exceptionally well in my preliminary tests!

I was very impressed too. My own dumb little test is to make the LLM write a story and see how long it take before it says something absurd. None of the other Yi models (including the Dolphin one) do too well. This model actually approaches 70B performance though.

Yes I was asking about the CUDA error. I've confirmed it on two separate Linux systems with llama.cpp, testing two Yi Llamafied 34B fine tuned models (this one, and Dolphin 2.2).

One system has 8 x A6000 - but I'm limiting it to a single GPU using CUDA_VISIBLE_DEVICES. The other a single H100. Both are on CUDA 11.8.

Every generation using -ngl X fails with the error shown before, on those two models. I've not tested other Yi models yet, but other models (Llama 2 13B, Mistral 7B, etc) work fine on the same systems.

I'll report on llama.cpp.

I'll report on llama.cpp.

That's probably the best thing to do. Sorry, I'm not aware of a non-multi GPU problem with these. All I can say is offloading a few layers works for me on ROCM. I only have an 8GB GPU so I can't do full offloading.

I also don't think there should be a difference for these compared to the original models in terms of CUDA stuff. I don't know all that much about it though.

Closing this now since you resolved the issue. Thanks!

KerfuffleV2 changed discussion status to closed

There's an issue for it already, apparently: https://github.com/ggerganov/llama.cpp/issues/4075

I added your information there. edit: Seems like it's crashing on a different line than that , but both are for mulmat though so it probably is related still?

Thanks very much Kerfuffle! The issue has now been resolved by slaren in https://github.com/ggerganov/llama.cpp/pull/4084

Sign up or log in to comment