People complaining about how hard to get simple answer is don't appreciate the complexity in figuring out optimal models...
There's so many knobs to tweak, it's a non trivial problem
- Average/median length of your Prompts
- prompt eval speed (tok/s)
- token generation speed (tok/s)
- Image/media encoding speed for vision tasks
- Total amount of RAM
- Max bandwidth of ram (ddr4, ddr5, etc.?)
- Total amount of VRAM
- "-ngl" (amount of layers offloaded to GPU)
- Context size needed (you may need sub 16k for OCR tasks for instance)
- Size of billion parameters
- Size of active billion parameters for MoE
- Acceptable level of Perplexity for your use case(s)
- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)
- even finer grain knobs: temperature, penalties etc.
Also, Tok/s as a metric isn't enough then because there's:
- thinking vs non-thinking: which mode do you need?
- models that are much more "chatty" than others in the same area (i remember testing few models that max out my modest desktop specs, qwen 2.5 non-thinking was so much faster than equivalent ministral non-thinking even though they had equivalent tok/s... Qwen would respond to the point quickly)
At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?
The same exercise with paid APIs exists too, obviously less knobs but depending on your use case, there's still differences between providers and models. You can abstract away a lot of the knobs , just add "are you satisfied with how much it cost" on top of the other 2 questions