\

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

152 points - today at 4:02 PM

Source
  • vanyaland

    today at 6:41 PM

    For a lot of local workloads, sub-1 tok/s is useless in foreground and perfectly acceptable in background. If the choice is “this crashes” vs “this finishes overnight,” that’s still a meaningful capability jump.

    • shubhamintech

      today at 7:50 PM

      The MoE point matters here ie sparse activation means you're not reading all 2TB per forward pass, but the access pattern flips from sequential to random which is exactly the worst case for NVMe. Been thinking about this a lot for agent inference workloads where you want consistent latency more than peak throughput.

      • vicchenai

        today at 5:34 PM

        the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.

        for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.

        still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.

          • p_ing

            today at 6:27 PM

            4K random read with a queue depth of 1 on an M1 Max is about 65MB/s.

            • tatef

              today at 6:25 PM

              Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.

              • zozbot234

                today at 5:46 PM

                > for a 1T model youd need to stream something like 2TB of weights per forward pass

                Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.

                  • visarga

                    today at 5:52 PM

                    But across a sequence you still have to load most of them.

            • marksully

              today at 4:42 PM

              Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

                • tatef

                  today at 6:27 PM

                  I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)

                  • causal

                    today at 4:49 PM

                    Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

                • baq

                  today at 5:09 PM

                  Intel Optane rolling in its grave.

                    • aitchnyu

                      today at 6:30 PM

                      Memristors are also missing in this AI hype even when they were around the corner 10 years back.

                      • liuliu

                        today at 5:13 PM

                        Still have 4 brand new ones in my storage unit. Just in case these moments.

                        Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

                          • zozbot234

                            today at 5:23 PM

                            It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.

                        • speedgoose

                          today at 5:29 PM

                          Is it too late for Intel to bring them back to life?

                            • c0balt

                              today at 5:34 PM

                              Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).

                              • walterbell

                                today at 6:58 PM

                                Nvidia and SK Hynix are bringing HBF to market for $$.

                            • moffkalast

                              today at 5:36 PM

                              Wouldn't be Intel if they didn't quit halfway through on a good thing.

                              Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.

                              • 0ptan3

                                today at 5:31 PM

                                pmem

                            • Insanity

                              today at 4:51 PM

                              This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.

                              I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

                                • zozbot234

                                  today at 4:57 PM

                                  This is not putting any stress or wear on the NVMe, it's a pure read workload.

                                    • tatef

                                      today at 6:29 PM

                                      Yes, exactly this.

                                  • embedding-shape

                                    today at 4:59 PM

                                    > but in a 'smart' way so you don't overload the NVMe unnecessarily

                                    "overloading NVMe"? What is that about? First time I've heard anything about it.

                                    > because putting a ton of stress on your NVMe during generation

                                    Really shouldn't "stress your NVMe", something is severely wrong if that's happening. I've been hammering my SSDs forever, and while write operations "hurt" the longevity of the flash cells themselves, the controller interface really shouldn't be affected by this at all, unless I'm missing something here.

                                      • hrmtst93837

                                        today at 7:32 PM

                                        People talk about "SSD endurance", but enough parallel I/O on M1/M2 can make the NVMe controller choke, with very weird latncy spikes.

                                        • tatef

                                          today at 6:30 PM

                                          Hypura reads tensor weights from the GGUF file on NVMe into RAM/GPU memory pools, then compute happens entirely in RAM/GPU.

                                          There is no writing to SSDs on inference with this architecture.

                                            • embedding-shape

                                              today at 7:29 PM

                                              Even if there was a ton of writing, I'm not sure where NVMe even comes in the picture, write durability is about the flash cells on SSDs, nothing to do with the interface, someone correct me if I'm wrong.

                                          • Insanity

                                            today at 5:05 PM

                                            I had assumed heat generation on the controller if it's continuously reading. But maybe it's not actually bad.

                                              • throwway120385

                                                today at 5:45 PM

                                                Just pop a heatsink on it and call it good.

                                    • zozbot234

                                      today at 4:47 PM

                                      It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.

                                        • salynchnew

                                          today at 5:09 PM

                                          It was written by an LLM, so... yeah.

                                          • jeffybefffy519

                                            today at 5:30 PM

                                            Except this isnt using heavily quantised versions of the model thus reducing quality.

                                        • root_axis

                                          today at 5:50 PM

                                          Are there any 1T parameter open source models?

                                            • zozbot234

                                              today at 5:52 PM

                                              Kimi 2.5?

                                                • ai-inquisitor

                                                  today at 6:22 PM

                                                  That model is "open weight", not open source. We have no idea what data Moonshot trained on.

                                                  • root_axis

                                                    today at 6:00 PM

                                                    Thanks, TIL.

                                            • nullbyte

                                              today at 5:18 PM

                                              I am curious how the TPS compares vs default OS virtual memory paging

                                              • speedgoose

                                                today at 5:34 PM

                                                I wonder how many minutes per token on GLM 5.

                                                • amelius

                                                  today at 5:32 PM

                                                  This is <1 tok/s for the 40GB model.

                                                  Come on, "Run" is not the right word. "Crawl" is.

                                                  Headlines like that are misleading.

                                                    • feznyng

                                                      today at 6:33 PM

                                                      Could still be useful; maybe for overnight async workloads? Tell your agent research xyz at night and wake up to a report.

                                                        • maleldil

                                                          today at 6:50 PM

                                                          Assuming 1 token per second and "overnight" being 12 hours, that's 43 200 tokens. I'm not sure what you can meaningfully achieve with that.

                                                      • smlacy

                                                        today at 5:59 PM

                                                        Yes, and with virtually zero context, which makes an enormous difference for TTFT on the MoE models.

                                                    • monksy

                                                      today at 5:04 PM

                                                      There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.

                                                        • zozbot234

                                                          today at 5:16 PM

                                                          Ollama has very substandard support for mmap at present, which hurts inference with larger models. There are some recent pull requests in flight that should help address this to at least some extent https://github.com/ollama/ollama/pull/14525 https://github.com/ollama/ollama/pull/14134 https://github.com/ollama/ollama/pull/14864 but progress seems to be stalling. Their support for recent Qwen models seems to also have some bespoke incompatibilities with llama.cpp, which doesn't help matters; it's difficult to test the same model with both.

                                                          • rubiquity

                                                            today at 5:10 PM

                                                            llama.cpp and llama-swap do this better than Ollama and with far more control.

                                                              • circularfoyers

                                                                today at 6:51 PM

                                                                Don't even need to use llama-swap anymore now that llama-server supports the same functionality.

                                                        • EnPissant

                                                          today at 5:28 PM

                                                          You do not provide any comparison to llama.cpp with mmap.

                                                          You do not explain how any kind of predictor can work for MoE experts.

                                                          You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).

                                                          • anshulbasia27

                                                            today at 5:25 PM

                                                            OS paging would be significantly worse here. The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch. You stall on every fault, wait for the 4KB/16KB page to load, then resume. With 80 layers of dense FFN streaming, that's thousands of cold faults per token.

                                                              What makes this approach faster is that the model's access pattern is completely deterministic during         
                                                              inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
                                                              you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
                                                              The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."
                                                            
                                                              For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
                                                              then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
                                                              expert 7. The neuron cache here is basically a domain-specific replacement policy.

                                                              • zozbot234

                                                                today at 5:26 PM

                                                                > The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch.

                                                                man 2 madvise

                                                                • EnPissant

                                                                  today at 5:30 PM

                                                                  That assumes you have significant work to do between fetches (so you can prefetch while using the current data). With LLM decode you don't.

                                                                  • today at 5:30 PM

                                                                • Yanko_11

                                                                  today at 6:01 PM

                                                                  [dead]

                                                                  • anshulbasia27

                                                                    today at 5:24 PM

                                                                    [dead]

                                                                    • jee599

                                                                      today at 6:26 PM

                                                                      [dead]

                                                                      • tatef

                                                                        today at 4:04 PM

                                                                        [flagged]

                                                                          • password4321

                                                                            today at 4:50 PM

                                                                            Don't post generated/AI-edited comments. HN is for conversation between humans

                                                                            https://news.ycombinator.com/item?id=47340079

                                                                              • tatef

                                                                                today at 6:36 PM

                                                                                Noted, thanks. I had LLM help positioning this message but I did the initial draft along with edits. Will keep in mind for the future.

                                                                                • DennisP

                                                                                  today at 5:15 PM

                                                                                  That doesn't read like an AI-generated comment to me. He did mention he vibe-coded the project but that's not against the guidelines.

                                                                                    • Retr0id

                                                                                      today at 5:26 PM

                                                                                      It's either written by an LLM, or written by someone who learned to write by reading LLM output

                                                                                      • password4321

                                                                                        today at 5:27 PM

                                                                                        Vibe-coded project is fine.

                                                                                        At least prompt your LLM to dodge the obvious tells when commenting!

                                                                                        • Forgeties79

                                                                                          today at 5:23 PM

                                                                                          gptzero says 99% chance it’s AI-generated

                                                                                          It certainly has a lot of telltale signs

                                                                                          • Izikiel43

                                                                                            today at 5:17 PM

                                                                                            > The core insight:

                                                                                            That's a telltale sign of ai written text.

                                                                                    • causal

                                                                                      today at 4:50 PM

                                                                                      You need to change the title or actually include 1T parameter model content.

                                                                                      • frikk

                                                                                        today at 4:46 PM

                                                                                        This is interesting work, thank you for sharing. What hardware would you buy today for experimenting? Seems like the new gen of macbook pros are pretty powerful?

                                                                                          • tatef

                                                                                            today at 6:38 PM

                                                                                            Yes definitely. I use a M1 Max with 32gb of RAM daily and it's about on par from a performance standpoint with the new base M5 Pro 24gb. You can check the benchmarks in the repo if you're interested in seeing specific performance metrics, but investing in Apple hardware with as much memory as possible will generally get you furthest in this game.

                                                                                        • WithinReason

                                                                                          today at 4:55 PM

                                                                                          Have you ever generated access frequency statistics for the experts in these models, something like a histogram?

                                                                                        • lostmsu

                                                                                          today at 4:47 PM

                                                                                          Why would llama with --mmap crash?

                                                                                            • zozbot234

                                                                                              today at 4:58 PM

                                                                                              This doesn't surprise me all that much, mmap support gets little attention in general and interacts poorly with GPU-side inference. (And that's with it being default, you don't even really need to specify it as a CLI option.) OP has raised a discussion with the llama.cpp folks https://github.com/ggml-org/llama.cpp/discussions/20852 but little interest so far

                                                                                      • erikcw

                                                                                        today at 5:37 PM

                                                                                        Simon Willison wrote a good post about Dan Woods’ work on “Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally”.

                                                                                        [0] https://simonwillison.net/2026/Mar/18/llm-in-a-flash/