refulgentis
today at 5:27 PM
It's super annoying when you have products that utilize these because there's...4? releases in 3 weeks?
- Gemma 4 2B/4B/27BE3B/31B
- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)
- Gemma 4 12B (2 days ago? 1?)
- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)
It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.
Extremely glad for the output, not glad to have to chase it.
ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.
Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)
These models aren't products? They are open source ish (open weight I guess), research outputs. While the naming scheme may be confusing, it is relevant and important. I believe it's on you to understand it.
refulgentis
today at 6:10 PM
I understand it. :)
satvikpendem
today at 5:37 PM
Just use Unsloth Studio it supports them all.