Tiled Hacker news on React Router

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

152 points - today at 4:02 PM

vanyaland
today at 6:41 PM
For a lot of local workloads, sub-1 tok/s is useless in foreground and perfectly acceptable in background. If the choice is “this crashes” vs “this finishes overnight,” that’s still a meaningful capability jump.
shubhamintech
today at 7:50 PM
The MoE point matters here ie sparse activation means you're not reading all 2TB per forward pass, but the access pattern flips from sequential to random which is exactly the worst case for NVMe. Been thinking about this a lot for agent inference workloads where you want consistent latency more than peak throughput.
vicchenai
today at 5:34 PM
the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.
for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.
still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.
marksully
today at 4:42 PM
Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.
baq
today at 5:09 PM
Intel Optane rolling in its grave.
Insanity
today at 4:51 PM
This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.
I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.
zozbot234
today at 4:47 PM
It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.
root_axis
today at 5:50 PM
Are there any 1T parameter open source models?
nullbyte
today at 5:18 PM
I am curious how the TPS compares vs default OS virtual memory paging
speedgoose
today at 5:34 PM
I wonder how many minutes per token on GLM 5.
amelius
today at 5:32 PM
This is <1 tok/s for the 40GB model.
Come on, "Run" is not the right word. "Crawl" is.
Headlines like that are misleading.
monksy
today at 5:04 PM
There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.
EnPissant
today at 5:28 PM
You do not provide any comparison to llama.cpp with mmap.
You do not explain how any kind of predictor can work for MoE experts.
You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).

anshulbasia27

today at 5:25 PM

OS paging would be significantly worse here. The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch. You stall on every fault, wait for the 4KB/16KB page to load, then resume. With 80 layers of dense FFN streaming, that's thousands of cold faults per token.

  What makes this approach faster is that the model's access pattern is completely deterministic during         
  inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
  you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
  The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."

  For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
  then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
  expert 7. The neuron cache here is basically a domain-specific replacement policy.

zozbot234
today at 5:26 PM
> The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch.
man 2 madvise
EnPissant
today at 5:30 PM
That assumes you have significant work to do between fetches (so you can prefetch while using the current data). With LLM decode you don't.
today at 5:30 PM

Yanko_11
today at 6:01 PM
[dead]
anshulbasia27
today at 5:24 PM
[dead]
jee599
today at 6:26 PM
[dead]
tatef
today at 4:04 PM
[flagged]
erikcw
today at 5:37 PM
Simon Willison wrote a good post about Dan Woods’ work on “Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally”.
[0] https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

vanyaland

shubhamintech

vicchenai

p_ing

tatef

zozbot234

visarga

marksully

tatef

causal

baq

aitchnyu

liuliu

zozbot234

speedgoose

c0balt

walterbell

moffkalast

0ptan3

Insanity

zozbot234

tatef

embedding-shape

hrmtst93837

tatef

embedding-shape

Insanity

throwway120385

zozbot234

salynchnew

jeffybefffy519

root_axis

zozbot234

ai-inquisitor

root_axis

nullbyte

speedgoose

amelius

feznyng

maleldil

smlacy

monksy

zozbot234

rubiquity

circularfoyers

EnPissant

anshulbasia27

zozbot234

EnPissant

Yanko_11

anshulbasia27

jee599

tatef

password4321

tatef

DennisP

Retr0id

password4321

Forgeties79

Izikiel43

causal

frikk

tatef

WithinReason

Gracana

lostmsu

zozbot234

erikcw