Below are my test results after running local LLMs on two machines.
I'm using LM Studio now for ease of use and simple logging/viewing of previous conversations. Later I'm gonna use my own custom local LLM system on the Mac Studio, probably orchestrated by LangChain and running models with llama.cpp.
My goal has all the time been to use them in ensembles in order to reduce model biases. The same principle has just now been introduced as a feature called "model council" in Perplexity Max: https://www.perplexity.ai/hub/blog/introducing-model-council
Chats will be stored in and recalled from a PostgreSQL database with extensions for vectors (pgvector) and graph (Apache AGE).
For both sets of tests below, MLX was used when available, but ultimately ran at almost the same speed as GGUF.
I hope this information helps someone!
/////////
Mac Studio M3 Ultra (default w/96 GB RAM, 1 TB SSD, 28C CPU, 60C GPU):
β’ Gemma 3 27B (Q4_K_M): ~30 tok/s, TTFT ~0.52 s
β’ GPT-OSS 20B: ~150 tok/s
β’ GPT-OSS 120B: ~23 tok/s, TTFT ~2.3 s
β’ Qwen3 14B (Q6_K): ~47 tok/s, TTFT ~0.35 s
(GPT-OSS quants and 20B TTFT info not available anymore)
//////////
MacBook Pro M1 Max 16.2" (64 GB RAM, 2 TB SSD, 10C CPU, 32C GPU):
β’ Gemma 3 1B (Q4_K): ~85.7 tok/s, TTFT ~0.39 s
β’ Gemma 3 27B (Q8_0): ~7.5 tok/s, TTFT ~3.11 s
β’ GPT-OSS 20B (8bit): ~38.4 tok/s, TTFT ~21.15 s
β’ LFM2 1.2B: ~119.9 tok/s, TTFT ~0.57 s
β’ LFM2 2.6B (Q6_K): ~69.3 tok/s, TTFT ~0.14 s
β’ Olmo 3 32B Think: ~11.0 tok/s, TTFT ~22.12 s