Can we get a draft model for this model please?

#2
by CYISNOTHERE - opened

I could probably get around to training a 100M parameter Eagle 2 model in a few weeks, the problem is that I need a few hundred thousand sequences from the 31B model which I can only run with offloading to the system memory so that might take awhile, the E4B model should have out of the box support to be a draft model though, its just bigger than Eagle for the same performance and you have to have enough memory to run a 8B model and a 31B model at the same time.

Ok, there is no support for drafting between those two, but this has already been trained by someone else https://huggingface.co/thoughtworks/Gemma-4-31B-Eagle3

TeichAI org

I don't think the tokenizer changed, any gemma4 model should be support as a drafting model.

I was about to edit my reply but thought to refresh just incase, bruh

Yall were fast!

Thanks for the response.

Also the thinking blocks and use of markdown might be broken 👀

Basically, i haven't even used it for tool use yet, but i predict it won't work well with anything that uses tools

TeichAI org

Also the thinking blocks and use of markdown might be broken 👀

Could you be bit more specific here? Perhaps a screenshot of your broken output, as well as some info on how you're running your inference would be helpful.

TeichAI org

"Yall were fast!"
When you get bored and keep refreshing a page, you catch things pretty quickly 🤣

TeichAI org

Wait for v2, I was using it in cline/continue flawlessly. It built me a web app, setup local supabase, and wired everything together, frontend and backend :)

Also the thinking blocks and use of markdown might be broken 👀

Could you be bit more specific here? Perhaps a screenshot of your broken output, as well as some info on how you're running your inference would be helpful.

image

The model also breaks by it starting to repeat itself mid generation once given a long enough task and this was found because flash attention was enabled. (I am unsure if this is a normal part of local models where some can use it, some can't. but this model cannot use flash attention, or at least when some parts of the model is offloaded)

TeichAI org

Seems like your agent runner (looks like LMStudio) doesn't support the Gemma4 thinking format? Could you provide a side-by-side with the Teich model & a regular Gemma 4 model?

TeichAI org

oh that's because it's not trained to have a new line after closing the channel tag. so if you dont have reasoning parsing setup properly markdown renderers wont know to start after the end of the <channel|> tag

TeichAI org

The model also breaks by it starting to repeat itself mid generation once given a long enough task and this was found because flash attention was enabled. (I am unsure if this is a normal part of local models where some can use it, some can't. but this model cannot use flash attention, or at least when some parts of the model is offloaded)

I did see this flash attention issue as well though, the v2 was working better with fa on but still trips up occasionally

Seems like your agent runner (looks like LMStudio) doesn't support the Gemma4 thinking format? Could you provide a side-by-side with the Teich model & a regular Gemma 4 model?

Your model

image

Original Gemma 4 Model

image

Note: I have not been able to get Gemma 4-31B it to think on LM studio (i've seen that it has a reasoning capability online, but i have yet to see the variant I downloaded think, the original variant is the LM Studio Community edition one i have downloaded)

TeichAI org

oh that's because it's not trained to have a new line after closing the channel tag. so if you dont have reasoning parsing setup properly markdown renderers wont know to start after the end of the <channel|> tag

I think @armand0e got this right.

The model also breaks by it starting to repeat itself mid generation once given a long enough task and this was found because flash attention was enabled. (I am unsure if this is a normal part of local models where some can use it, some can't. but this model cannot use flash attention, or at least when some parts of the model is offloaded)

I did see this flash attention issue as well though, the v2 was working better with fa on but still trips up occasionally

Mind you, I am offloading because I have a 5090 and 128GB of Ram

TeichAI org

I don't see how offloading could cause this. You could try only using the CPU. But other than that I would recommend waiting for v2

I don't see how offloading could cause this. You could try only using the CPU. But other than that I would recommend waiting for v2

V2 it is then. i'll keep an eye out 👍

TeichAI org

Seems resolved enough to close.

CompactAI changed discussion status to closed
TeichAI org

Only took 2 hours to solve this. That might be a record. 🤣

TeichAI org

so you are testing in LM studio then correct? if so here is your fix:

  1. Head to the models tab
  2. Click the gear icon next to our model
  3. Select the inference tab all the way to the right and expand the Reasoning Parsing section.
  4. Change the Start String to <|channel>thought and the End String to <channel|>

image

armand0e changed discussion status to open
TeichAI org
edited Apr 10

let me know if it works. Personally, I think you may get past that first hurdle and just be met with other issues. I will be reupdating these ggufs momentarily with the latest llama.cpp gemma 4 fixes

image

TeichAI org

guessing it's the early-stopping/truncation issue with the old ggufs

updates going live now

TeichAI org

please try again with the latest ggufs, they are up and tested. Confirmed working on my end (via llama.cpp chat ui)

I don't think the tokenizer changed, any gemma4 model should be support as a drafting model.

Well the E models have per layer embeddings which the 31B does not have, they have only 131k context, the 31B has 256k, and there tokenizer works with video and audio, whereas the 31B model only works with vision

And the Eagle is more accurate anyway because it is trained on the hidden states of the model, whereas the E4B is just theoretically trained on the same data, and of course the E4B is actually 8B, so you have to fit a 8B and a 31B model in context vs. a 31B model and a 350M model where the 350M is better. Though eagle3 is not widely supported yet.

Sign up or log in to comment