No MTP layers?

by soyalemujica - opened 2 days ago

I'm trying to quantize this into a GGUF model, but I think it does not have MTP layers? I'm using the normal convert_to_gguf.py file from llama.cpp, no speciail arguments, just normal stuff to generate the GGUF model in F16 and later quantizing to Q6K and Q5KM, however, trying to pass MTP does not work

pearsonkyle

2 days ago

I'm trying to quantize this into a GGUF model, but I think it does not have MTP layers? I'm using the normal convert_to_gguf.py file from llama.cpp, no speciail arguments, just normal stuff to generate the GGUF model in F16 and later quantizing to Q6K and Q5KM, however, trying to pass MTP does not work

https://huggingface.co/pearsonkyle/tmax-27b-imatrix-MTP-GGUF

soyalemujica

2 days ago

I'm trying to quantize this into a GGUF model, but I think it does not have MTP layers? I'm using the normal convert_to_gguf.py file from llama.cpp, no speciail arguments, just normal stuff to generate the GGUF model in F16 and later quantizing to Q6K and Q5KM, however, trying to pass MTP does not work

https://huggingface.co/pearsonkyle/tmax-27b-imatrix-MTP-GGUF

Too low of a quantification for my needs, Q5KM or Q6K is needed, Q4KM always gets stuck in thinking loops

pearsonkyle

2 days ago

•

edited 2 days ago

I'm trying to quantize this into a GGUF model, but I think it does not have MTP layers? I'm using the normal convert_to_gguf.py file from llama.cpp, no speciail arguments, just normal stuff to generate the GGUF model in F16 and later quantizing to Q6K and Q5KM, however, trying to pass MTP does not work

https://huggingface.co/pearsonkyle/tmax-27b-imatrix-MTP-GGUF

Too low of a quantification for my needs, Q5KM or Q6K is needed, Q4KM always gets stuck in thinking loops

I can make you one with the MTP and upload after it's done, in ~hour. Also, I think you'd be surprised by the imatrix quants (probably the iq4xs) in my repo. They are specifically calibrated on usage logs for claude code, open code and qwen code with a setting enabled to help parse special tokens such as the ones at the beginning of chat templates for tool calls. Each quant was used to solve 10 different repo issues in a manner like SWEbench but using nebius/rebench dataset. None appear to get stuck in loops or reach the max number of turns. However, sampling with a repetition penalty > 1 can sometimes help these qwen models from looping.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment