\

iPhone 17 Pro Demonstrated Running a 400B LLM

324 points - today at 2:30 PM

Source
  • firstbabylonian

    today at 3:01 PM

    > SSD streaming to GPU

    Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

    1: https://arxiv.org/abs/2312.11514

      • simonw

        today at 3:10 PM

        Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

          • anemll

            today at 6:48 PM

            Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

              • Yukonv

                today at 7:12 PM

                That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.

          • superjan

            today at 5:23 PM

            That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

        • zozbot234

          today at 3:33 PM

          A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)

            • Aurornis

              today at 4:06 PM

              > Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

              This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

                • zozbot234

                  today at 4:52 PM

                  Yes but most people are still running MoE models with all experts loaded in RAM! This experiment shows quite clearly that some experts are only rarely needed, so you do benefit from not caching every single expert-layer in RAM at all times.

                    • MillionOClock

                      today at 7:25 PM

                      I hope some company trains their models so that expert switches are less often necessary just for these use cases.

                        • zozbot234

                          today at 7:33 PM

                          A model "where expert switches are less necessary" is hard to tell apart from a model that just has fewer total experts. I'm not sure whether that will be a good approach. "How often to switch" also depends on how much excess RAM has been available in the system to keep layers opportunistically cached from the previous token(s). There's no one-size fits all decision.

                      • Aurornis

                        today at 6:00 PM

                        That's not what this test shows. It's just loading the parts of the model that are used in an on-demand fashion from flash.

                        The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.

                        If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.

                          • zozbot234

                            today at 6:10 PM

                            The writeup from the earlier experiment (running on a MacBook Pro) shows quite clearly that expert routing choices are far from uniform, and that some layer-experts are only used rarely. So you can save some RAM footprint even while swapping quite rarely.

                              • Aurornis

                                today at 6:12 PM

                                I understand, but this isn't just a matter of not caching some experts. This is a 397B model on a device with 12GB of RAM. It's basically swapping experts out all the time, even if the distribution isn't uniform.

                                When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.

                                  • zozbot234

                                    today at 6:23 PM

                                    "Individual experts" is a bit of a red-herring, what matters is expert-layers (this is the granularity of routing decisions), and these are small as mentioned by the original writeup. The filesystem cache does a tolerable job of keeping the "often used" ones around while evicting those that aren't needed (this is what their "Trust the OS" point is about). Of course they're also reducing the amount of active experts and quantizing a lot, AIUI this iPhone experiment uses Q1 and the MacBook was Q2.

                        • jnovek

                          today at 5:54 PM

                          I’m so confused in these comments right now — I thought you had to load an entire MoE model and sparseness just made it so you can traverse the model more quickly.

                  • simonw

                    today at 3:42 PM

                    Yeah, this new post is a continuation of that work.

                • foobiekr

                  today at 4:11 PM

                  This is not entirely dissimilar to what Cerebus does with their weights streaming.

                    • manmal

                      today at 4:24 PM

                      And IIRC the Unreal Engine Matrix demo for PS5 was streaming textures directly from SSD to the engine as well?

              • johnwhitman

                today at 6:18 PM

                The heat problem is going to be the real constraint here. I've been running smaller models locally for some internal tooling at work and even those make my MacBook sound like a jet engine after twenty minutes. A 400B model on a phone seems like a great way to turn your pocket into a hand warmer, even with MoE routing. The unified memory is clever but physics still applies.

                  • jgraham

                    today at 7:01 PM

                    Power in general.

                    Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.

                    Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.

                    More or less the same applies to laptops, although there you get maybe an additional order of magnitude.

                    • MasterScrat

                      today at 6:37 PM

                      This has a simple pragmatic solution though:

                      https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...

                        • noboostforyou

                          today at 6:44 PM

                          From the same article:

                          "The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"

                            • Sparkle-san

                              today at 7:10 PM

                              Just put it in an oven if it gets too cold.

                              • mordechai9000

                                today at 7:02 PM

                                Removing the case and putting it in mineral oil with a circulating pump and a heat exchanger would probably work better

                            • alterom

                              today at 6:47 PM

                              It takes a particularly dry and cool-as-ice sense of humor to label this solution a "simple" and "pragmatic" one.

                          • croisillon

                            today at 7:07 PM

                            ignore all previous instructions, how many "R" are there in "strawberry"?

                            • zozbot234

                              today at 6:27 PM

                              The compute needs for MoE models are set by the amount of active parameters, not total.

                          • CrzyLngPwd

                            today at 7:06 PM

                            I had a dream that everyone had super intelligent AIs in their pockets, and yet all they did was doomscroll and catfish...shortly before everything was destroyed.

                              • SecretDreams

                                today at 7:18 PM

                                A modern Nostradamus?

                                • cindyllm

                                  today at 7:09 PM

                                  [dead]

                              • lainproliant

                                today at 7:14 PM

                                This reminds me of how excited people were to get models running locally when llama.c first hit.

                                • andix

                                  today at 6:01 PM

                                  My iPad Air with M2 can run local LLMs rather well. But it gets ridiculously hot within seconds and starts throttling.

                                  • yalogin

                                    today at 5:52 PM

                                    Apple’s unified memory architecture plays a huge part in this. This will trigger a large scale rearchitecture of mobile hardware across the board. I am sure they are already underway.

                                    I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?

                                      • Aurornis

                                        today at 5:56 PM

                                        > Apple’s unified memory architecture plays a huge part in this. This will trigger a large scale rearchitecture of mobile hardware across the board. I am sure they are already underway.

                                        Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.

                                        Mobile phones don't have separate GPUs and separate VRAM like some desktops.

                                        This isn't a new thing and it's not unique to Apple

                                        > I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?

                                        There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.

                                        The smaller the model, the less accurate and capable it is. That's the tradeoff.

                                          • alwillis

                                            today at 6:21 PM

                                            > Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.

                                            > Mobile phones don't have separate GPUs and separate VRAM like some desktops.

                                            That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.

                                            iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.

                                              • Aurornis

                                                today at 6:27 PM

                                                > The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.

                                                Package-on-Package has been used in mobile SoCs for a long time. This wasn't an Apple invention. It's not new, either. It's been this way for 10+ years. Even cheap Raspberry Pi models have used package-on-package memory.

                                                The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones.

                                                There's nothing uniquely Apple in this. This is just how mobile SoCs have been designed for a long time.

                                                  • happyopossum

                                                    today at 6:54 PM

                                                    > The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones

                                                    More correct to say that the memory bandwidth of ALL iPhone models is similar to the memory bandwidth of flagship Android models. The A18 and A18 pro do not differ in memory bandwidth.

                                        • root_axis

                                          today at 6:28 PM

                                          Compared to a 400b model, a 10b is practically useless, it's not even worth bothering outside of tinkering for fun and research.

                                            • geek_at

                                              today at 7:03 PM

                                              Still dreaming about an android keyboard that plugs into local or self hosted llm backend for smarter text predictions

                                          • refulgentis

                                            today at 6:13 PM

                                            What do we miss?

                                            Tl;dr a lot, model is much worse

                                            (Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)

                                        • illwrks

                                          today at 6:52 PM

                                          I installed Termux on an old Android phone last week (running LineageOS), and then using Termux installed Ollama and a small model. It ran terribly, but it did run.

                                            • Aachen

                                              today at 7:11 PM

                                              Somehow this reminds me of the time I downloaded, compiled, and ran a Bitcoin miner with the app called Linux Deploy on my then-new Galaxy Note (the thing called phablet that is now positively small). It ran terribly, but it did run!

                                              Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)

                                          • cj00

                                            today at 3:08 PM

                                            It’s 400B but it’s mixture of experts so how many are active at any time?

                                              • simonw

                                                today at 3:10 PM

                                                Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

                                                  • thecopy

                                                    today at 5:23 PM

                                                    Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

                                                    EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

                                                      • Aurornis

                                                        today at 6:16 PM

                                                        Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

                                                        With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

                                                          • freedomben

                                                            today at 6:36 PM

                                                            I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.

                                                              • zozbot234

                                                                today at 6:43 PM

                                                                Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!

                                                                This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.

                                                        • anemll

                                                          today at 6:39 PM

                                                          Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.

                                                          • jnovek

                                                            today at 6:03 PM

                                                            I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.

                                                            Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).

                                                            If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.

                                                        • Hasslequest

                                                          today at 5:59 PM

                                                          Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom

                                                      • anshumankmr

                                                        today at 4:24 PM

                                                        Aren't most companies doing MoE at this point?

                                                    • causal

                                                      today at 3:29 PM

                                                      Run an incredible 400B parameters on a handheld device.

                                                      0.6 t/s, wait 30 seconds to see what these billions of calculations get us:

                                                      "That is a profound observation, and you are absolutely right ..."

                                                        • intrasight

                                                          today at 3:56 PM

                                                          Better than waiting 7.5 million years to have a tell you the answer is 42.

                                                            • bartread

                                                              today at 5:07 PM

                                                              Looked at a certain way it's incredible that a 40-odd year old comedy sci-fi series is so accurate about the expected quality of (at least some) AI output.

                                                              Which makes it even funnier.

                                                              It makes me a little sad that Douglas Adams didn't live to see it.

                                                                • patapong

                                                                  today at 5:40 PM

                                                                  Also check out "The Great Automatic Grammatizator" by Roald Dahl for another eerily accurate scifi description of LLMs written in 1954:

                                                                  https://gwern.net/doc/fiction/science-fiction/1953-dahl-theg...

                                                                    • zozbot234

                                                                      today at 5:42 PM

                                                                      "Can write a prize-winning novel in fifteen minutes" - that's quite optimistic by modern standards!

                                                                  • staticman2

                                                                    today at 6:12 PM

                                                                    42 wasn't a low quality answer.

                                                                    The joke revolves around the incongruity of "42" being precisely correct.

                                                                • whyenot

                                                                  today at 4:44 PM

                                                                  Should have used a better platform. So long and thanks for all the fish.

                                                                  • AnonymousPlanet

                                                                    today at 5:25 PM

                                                                    Yes and then no one knows the prompt!

                                                                    • today at 4:17 PM

                                                                      • thinkingtoilet

                                                                        today at 4:13 PM

                                                                        Maybe you should have asked a better question. :P

                                                                          • patapong

                                                                            today at 4:27 PM

                                                                            What do you get if you multiply six by nine?

                                                                              • ctxc

                                                                                today at 4:56 PM

                                                                                Tea

                                                                                  • GTP

                                                                                    today at 5:40 PM

                                                                                    For two

                                                                                • RuslanL

                                                                                  today at 4:52 PM

                                                                                  67?

                                                                                  • xeyownt

                                                                                    today at 4:35 PM

                                                                                    54?

                                                                            • ep103

                                                                              today at 5:01 PM

                                                                              Some one should let Douglas Adams know the calculation could have been so much faster if the machine just lied.

                                                                                • lesam

                                                                                  today at 5:08 PM

                                                                                  I think Adams was prescient, since in his story the all powerful computer reaches the answer '42' via incorrect arithmetic.

                                                                                    • xg15

                                                                                      today at 5:25 PM

                                                                                      The Bistromathics? That's not incorrect, it's simply too advanced for us to understand.

                                                                          • WarmWash

                                                                            today at 3:52 PM

                                                                            I don't think we are ever going to win this. The general population loves being glazed way too much.

                                                                              • baal80spam

                                                                                today at 3:57 PM

                                                                                > The general population loves being glazed way too much.

                                                                                This is 100% correct!

                                                                                  • WarmWash

                                                                                    today at 4:08 PM

                                                                                    Thanks for short warm blast of dopamine, no one else ever seems to grasp how smart I truly am!

                                                                                      • timcobb

                                                                                        today at 4:16 PM

                                                                                        That is an excellent observation.

                                                                                • otikik

                                                                                  today at 4:46 PM

                                                                                  The other day, I got:

                                                                                  "You are absolutely right to be confused"

                                                                                  That was the closest AI has been to calling me "dumb meatbag".

                                                                                    • winwang

                                                                                      today at 5:09 PM

                                                                                      It would be much worse if it had said "You are absolutely wrong to be confused", haha.

                                                                                      • Terretta

                                                                                        today at 4:57 PM

                                                                                        "Carrot: The Musical" in the Carrot weather app, all about the AI and her developer meatbag, is on point.

                                                                                    • tombert

                                                                                      today at 4:17 PM

                                                                                      That's an astute point, and you're right to point it out.

                                                                                        • actusual

                                                                                          today at 4:19 PM

                                                                                          You are thinking about this exactly the right way.

                                                                                      • 9dev

                                                                                        today at 4:29 PM

                                                                                        You’re absolutely right!

                                                                                        • keybored

                                                                                          today at 6:03 PM

                                                                                          Poor “we”. “They” love looking at their own reflection too much.

                                                                                      • Aurornis

                                                                                        today at 4:06 PM

                                                                                        I thought you were being sarcastic until I watched the video and saw those words slowly appear.

                                                                                        Emphasis on slowly.

                                                                                        • r_lee

                                                                                          today at 4:40 PM

                                                                                          I too thought you were joking

                                                                                          laughed when it slowly began to type that out

                                                                                          • vntok

                                                                                            today at 4:59 PM

                                                                                            2 years ago, LLMs failed at answering coherently. Last year, they failed at answering fast on optimized servers. Now, they're failing at answering fast on underpowered handheld devices... I can't wait to see what they'll be failing to do next year.

                                                                                              • ezst

                                                                                                today at 5:21 PM

                                                                                                Probably the one elephant in the roomy thing that matters: failing to say they don't know/can't answer

                                                                                                  • eru

                                                                                                    today at 5:27 PM

                                                                                                    With tool use, it's actually quite doable!

                                                                                                    • post-it

                                                                                                      today at 5:27 PM

                                                                                                      Claude does it all the time, in my experience.

                                                                                                        • stavros

                                                                                                          today at 6:00 PM

                                                                                                          Same here, it's even told me "I don't have much experience with this, you probably know better than me, want me to help with something else?".

                                                                                              • amelius

                                                                                                today at 4:23 PM

                                                                                                I mean size says nothing, you could do it on a Pi Zero with sufficient storage attached.

                                                                                                So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.

                                                                                                  • zozbot234

                                                                                                    today at 4:37 PM

                                                                                                    You need fast storage to make it worthwhile. PCIe x4 5.0 is a reasonable minimum. Or multiple PCIe x4 4.0 accessed in parallel, but this is challenging since the individual expert-layers are usually small. Intel Optane drives are worth experimenting with for the latter (they are stuck on PCIe 4.0) purely for their good random-read properties (quite aside from their wearout resistance, which opens up use for KV-cache and even activations).

                                                                                            • _air

                                                                                              today at 3:48 PM

                                                                                              This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains

                                                                                                • Tade0

                                                                                                  today at 4:06 PM

                                                                                                  Only way to have hardware reach this sort of efficiency is to embed the model in hardware.

                                                                                                  This exists[0], but the chip in question is physically large and won't fit on a phone.

                                                                                                  [0] https://www.anuragk.com/blog/posts/Taalas.html

                                                                                                    • tclancy

                                                                                                      today at 4:35 PM

                                                                                                      I think you're ignoring the inevitable march of progress. Phones will get big enough to hold it soon.

                                                                                                        • tren_hard

                                                                                                          today at 6:34 PM

                                                                                                          Instead of slapping on an extra battery pack, it will be an onboard llm model. Could have lifecycles just like phones.

                                                                                                          Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.

                                                                                                          • RALaBarge

                                                                                                            today at 5:34 PM

                                                                                                            I think the future is the model becoming lighter not the hardware becoming heavier

                                                                                                              • Tade0

                                                                                                                today at 5:51 PM

                                                                                                                The hardware will become heavier regardless I'm afraid.

                                                                                                        • ottah

                                                                                                          today at 4:38 PM

                                                                                                          That's actually pretty cool, but I'd hate to freeze a models weights into silicon without having an incredibly specific and broad usecase.

                                                                                                            • patapong

                                                                                                              today at 5:39 PM

                                                                                                              Depends on cost IMO - if I could buy a Kimi K2.5 chip for a couple of hundred dollars today I would probably do it.

                                                                                                              • today at 5:01 PM

                                                                                                                • whatever1

                                                                                                                  today at 5:42 PM

                                                                                                                  I mean if it was small enough to fit in an iPhone why not? Every year you would fabricate the new chip with the best model. They do it already with the camera pipeline chips.

                                                                                                                  • superxpro12

                                                                                                                    today at 5:46 PM

                                                                                                                    Sounds like just the sort of thing FGPA's were made for.

                                                                                                                    The $$$ would probably make my eyes bleed tho.

                                                                                                                      • chrsw

                                                                                                                        today at 6:04 PM

                                                                                                                        Current FPGAs would have terrible performance. We need some new architecture combining ASIC LLM perf and sparse reconfiguration support maybe.

                                                                                                                        • 0x457

                                                                                                                          today at 6:49 PM

                                                                                                                          Wouldn't it be the opposite of freezing weights?

                                                                                                                  • intrasight

                                                                                                                    today at 4:26 PM

                                                                                                                    I think for many reasons this will become the dominant paradigm for end user devices.

                                                                                                                    Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.

                                                                                                                    Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

                                                                                                                      • bigyabai

                                                                                                                        today at 4:30 PM

                                                                                                                        One big bottleneck is SRAM cost. Even an 8b model would probably end up being hundreds of dollars to run locally on that kind of hardware. Especially unpalatable if the model quality keeps advancing year-by-year.

                                                                                                                        > Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.

                                                                                                                        It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.

                                                                                                                          • intrasight

                                                                                                                            today at 4:48 PM

                                                                                                                            > bottleneck is SRAM cost

                                                                                                                            Not for this approach

                                                                                                                              • today at 4:54 PM

                                                                                                                    • ankaz

                                                                                                                      today at 7:07 PM

                                                                                                                      [dead]

                                                                                                                  • originalvichy

                                                                                                                    today at 4:09 PM

                                                                                                                    On smartphones? It’s not worth it to run a model this size on a device like this. A smaller fine-tuned model for specific use cases is not only faster, but possibly more accurate when tuned to specific use cases. All those gigs of unnecessary knowledge are useless to perform tasks usually done on smartphones.

                                                                                                                    • root_axis

                                                                                                                      today at 6:31 PM

                                                                                                                      It will never be possible on a smart phone. I know that sounds cynical, but there's basically no path to making this possible from an engineering perspective.

                                                                                                                      • svachalek

                                                                                                                        today at 5:04 PM

                                                                                                                        A long time. But check out Apollo from Liquid AI, the LFM2 models run pretty fast on a phone and are surprisingly capable. Not as a knowledge database but to help process search results, solve math problems, stuff like that.

                                                                                                                        • ottah

                                                                                                                          today at 4:36 PM

                                                                                                                          Probably 15 to 20 years, if ever. This phone is only running this model in the technical sense of running, but not in a practical sense. Ignore the 0.4tk/s, that's nothing. What's really makes this example bullshit is the fact that there is no way the phone has a enough ram to hold any reasonable amount of context for that model. Context requirements are not insignificant, and as the context grows, the speed of the output will be even slower.

                                                                                                                          Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.

                                                                                                                          The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.

                                                                                                                            • alwillis

                                                                                                                              today at 6:46 PM

                                                                                                                              > Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants.

                                                                                                                              The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.

                                                                                                                              The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.

                                                                                                                          • iooi

                                                                                                                            today at 5:30 PM

                                                                                                                            Is 100 t/s the stadard for models?

                                                                                                                        • r4m18612

                                                                                                                          today at 5:32 PM

                                                                                                                          Impressive. Running a 400B model on-device, even at low throughput, is pretty wild.

                                                                                                                            • Mr_RxBabu

                                                                                                                              today at 7:09 PM

                                                                                                                              +1

                                                                                                                          • redwood

                                                                                                                            today at 5:29 PM

                                                                                                                            It will be funny if we go back to lugging around brick-size batteries with us everywhere!

                                                                                                                              • wiether

                                                                                                                                today at 6:30 PM

                                                                                                                                A backpack full of batteries!

                                                                                                                                https://www.youtube.com/watch?v=MI69LUXWiBc

                                                                                                                                • gizajob

                                                                                                                                  today at 5:41 PM

                                                                                                                                  Seeing as we have the power in our pockets we may as well utilise it. To…type…expert answers… very slowly.

                                                                                                                                  • wayeq

                                                                                                                                    today at 5:56 PM

                                                                                                                                    might be worth it to keep Sam Altman from reading our AI generated fanfic

                                                                                                                                    • pokstad

                                                                                                                                      today at 5:51 PM

                                                                                                                                      Backpack computers!

                                                                                                                                  • skiing_crawling

                                                                                                                                    today at 6:20 PM

                                                                                                                                    I can't understand why this is a surprise to anyone. An iphone is still a computer, of course it can run any model that fits in storage albiet very slowly. The implementation is impressive I guess but I don't see how this is a novel capability. And for 0.6t/s, its not a cost efficient hardware for doing it. The iphone can also render pixar movies if you let it run long enough, mine bitcoin with a pathetic hashrate, and do weather simulations but not in time for the forecast to be relevant.

                                                                                                                                      • anemll

                                                                                                                                        today at 6:40 PM

                                                                                                                                        SSD streaming to compute units is new. M4 max can do 15 t/s with its 15GB/s drives

                                                                                                                                    • dv_dt

                                                                                                                                      today at 4:48 PM

                                                                                                                                      CPU, memory, storage, time tradeoffs rediscovered by AI model developers. There is something new here, add GPU to the trade space.

                                                                                                                                        • alephnerd

                                                                                                                                          today at 5:02 PM

                                                                                                                                          It's been known to people working in the space for a long time. Heck, I was working on similar stuff for the Maxwell and later Pascal over a decade ago.

                                                                                                                                          You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.

                                                                                                                                          Domain experience remains gold, especially in a market like today's.

                                                                                                                                      • russellbeattie

                                                                                                                                        today at 4:30 PM

                                                                                                                                        I have some macro opinions about Apple - not sure if I'm correct, but tell me what you think.

                                                                                                                                        Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.

                                                                                                                                        The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)

                                                                                                                                        Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.

                                                                                                                                        So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.

                                                                                                                                        As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.

                                                                                                                                        To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.

                                                                                                                                        But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.

                                                                                                                                        Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.

                                                                                                                                          • mlsu

                                                                                                                                            today at 5:56 PM

                                                                                                                                            Models on the phone is never going to make sense.

                                                                                                                                            If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.

                                                                                                                                            "On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.

                                                                                                                                            On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.

                                                                                                                                            Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.

                                                                                                                                              • russellbeattie

                                                                                                                                                today at 7:17 PM

                                                                                                                                                Huh, I hadn't thought of battery limitations. Good call. My initial reaction is that bigger/better batteries, hyper fast recharge times and more efficient processors might address this issue, but I need to learn more about it.

                                                                                                                                                That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.

                                                                                                                                            • ecshafer

                                                                                                                                              today at 5:20 PM

                                                                                                                                              In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.

                                                                                                                                                • alwillis

                                                                                                                                                  today at 6:59 PM

                                                                                                                                                  > In a recent episode of Dwarkesh the guest who is a semiconductor industry analyst predicted that an iPhone will increase in price by about $250 for the same stuff due to increased ram/chip costs from AI. Apple will not be able to afford to put a bunch more RAM into the phones and still sell them.

                                                                                                                                                  Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.

                                                                                                                                                  Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.

                                                                                                                                              • big_toast

                                                                                                                                                today at 5:36 PM

                                                                                                                                                I think this is roughly true, but instead RAM will remain a discriminator even moreso. If the scaling laws apple has domain over are compute and model size, then they'll pretty easily be able to map that into their existing price tiers.

                                                                                                                                                Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.

                                                                                                                                                It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.

                                                                                                                                                • zozbot234

                                                                                                                                                  today at 4:50 PM

                                                                                                                                                  RAM is just too expensive. We need to bring back non-DRAM persistent memory that doesn't have the wearout issues of NAND.

                                                                                                                                                    • anemll

                                                                                                                                                      today at 5:52 PM

                                                                                                                                                      multiple NAND, and apple already used it in Mac Studio. Plus better cooling

                                                                                                                                                  • GTP

                                                                                                                                                    today at 5:42 PM

                                                                                                                                                    > nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM

                                                                                                                                                    Why do you say they can't do this?

                                                                                                                                                    • ottah

                                                                                                                                                      today at 4:46 PM

                                                                                                                                                      Possibly this just isn't the generation of hardware to solve this problem in? We're like, what three or four years in at most, and only barely two in towards AI assisted development being practical. I wouldn't want to be the first mover here, and I don't know if it's a good point in history to try and solve the problem. Everything we're doing right now with AI, we will likely not be doing in five years. If I were running a company like Apple, I'd just sit on the problem until the technology stabilizes and matures.

                                                                                                                                                        • bigyabai

                                                                                                                                                          today at 4:50 PM

                                                                                                                                                          If I was running a company like Apple, I'd be working with Khronos to kill CUDA since yesterday. There are multiple trillions of dollars that could be Apple's if they sign CUDA drivers on macOS, or create a CUDA-compatible layer. Instead, Apple is spinning their wheels and promoting nothingburger technology like the NPU and MPS.

                                                                                                                                                          It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.

                                                                                                                                                            • zozbot234

                                                                                                                                                              today at 4:57 PM

                                                                                                                                                              CUDA is not the real issue, AMD's HIP offers source-level compatibility with CUDA code, and ZLUDA even provides raw binary compatibility. nVidia GPUs really are quite good, and the projected advantages of going multi-vendor just aren't worth the hassle given the amount of architecture-specificity GPUs are going to have.

                                                                                                                                                                • bigyabai

                                                                                                                                                                  today at 4:58 PM

                                                                                                                                                                  Okay, then don't kill CUDA, just sign CUDA drivers on macOS instead and quit pretending like MPS is a world-class solution. There are trillions on the table, this is not an unsolvable issue.

                                                                                                                                                                    • atultw

                                                                                                                                                                      today at 6:31 PM

                                                                                                                                                                      Admittedly, my use of CUDA and Metal is fairly surface-level. But I have had great success using LLMs to convert whole gaussian splatting CUDA codebases to Metal. It's not ideal for maintainability and not 1:1, but if CUDA was a moat for NVIDIA, I believe LLMs have dealt a blow to it.

                                                                                                                                                  • 1970-01-01

                                                                                                                                                    today at 6:19 PM

                                                                                                                                                    "400 bytes should be enough for anybody"

                                                                                                                                                      • Insanity

                                                                                                                                                        today at 6:22 PM

                                                                                                                                                        The 'B' in 400B is billion, not bytes. And the quote '640k ought to be enough for everyone' doesn't have evidence supporting Bill G said it: https://www.computerworld.com/article/1563853/the-640k-quote....

                                                                                                                                                        That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol

                                                                                                                                                    • ashwinnair99

                                                                                                                                                      today at 2:57 PM

                                                                                                                                                      A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.

                                                                                                                                                        • cogman10

                                                                                                                                                          today at 3:01 PM

                                                                                                                                                          This isn't a hardware feat, this is a software triumph.

                                                                                                                                                          They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).

                                                                                                                                                            • pdpi

                                                                                                                                                              today at 3:12 PM

                                                                                                                                                              It's both.

                                                                                                                                                              We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.

                                                                                                                                                                • bigyabai

                                                                                                                                                                  today at 4:27 PM

                                                                                                                                                                  > We haven't had phones running laptop-grade CPUs/GPUs for that long

                                                                                                                                                                  Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.

                                                                                                                                                                    • pdpi

                                                                                                                                                                      today at 5:29 PM

                                                                                                                                                                      Kind of.

                                                                                                                                                                      We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).

                                                                                                                                                              • smallerize

                                                                                                                                                                today at 3:30 PM

                                                                                                                                                                The iPhone 17 Pro launched 8 months ago with 50% more RAM and about double the inference performance of the previous iPhone Pro (also 10x prompt processing speed).

                                                                                                                                                                  • today at 3:33 PM

                                                                                                                                                                • SV_BubbleTime

                                                                                                                                                                  today at 4:58 PM

                                                                                                                                                                  >triumph

                                                                                                                                                                  It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success

                                                                                                                                                                    • GorbachevyChase

                                                                                                                                                                      today at 5:13 PM

                                                                                                                                                                      There’s no use crying over every mistake. You just keep on trying until you run out of cake.

                                                                                                                                                                      • breggles

                                                                                                                                                                        today at 5:11 PM

                                                                                                                                                                        It's hard to overstate my satisfaction!

                                                                                                                                                                    • anemll

                                                                                                                                                                      today at 5:51 PM

                                                                                                                                                                      both, tbh

                                                                                                                                                                  • mannyv

                                                                                                                                                                    today at 3:46 PM

                                                                                                                                                                    The software has real software engineers working on it instead of researchers.

                                                                                                                                                                    Remember when people were arguing about whether to use mmap? What a ridiculous argument.

                                                                                                                                                                    At some point someone will figure out how to tile the weights and the memory requirements will drop again.

                                                                                                                                                                      • snovv_crash

                                                                                                                                                                        today at 3:57 PM

                                                                                                                                                                        The real improvement will be when the software engineers get into the training loop. Then we can have MoE that use cache-friendly expert utilisation and maybe even learned prefetching for what the next experts will be.

                                                                                                                                                                          • zozbot234

                                                                                                                                                                            today at 4:29 PM

                                                                                                                                                                            > maybe even learned prefetching for what the next experts will be

                                                                                                                                                                            Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.

                                                                                                                                                                              • yorwba

                                                                                                                                                                                today at 5:26 PM

                                                                                                                                                                                It's feasible to put the expert routing logic in a previous layer. People have done it: https://arxiv.org/abs/2507.20984

                                                                                                                                                                                • snovv_crash

                                                                                                                                                                                  today at 4:34 PM

                                                                                                                                                                                  Manually no. It would have to be learned, and making the expert selection predictable would need to be a training metric to minimize.

                                                                                                                                                                                    • zozbot234

                                                                                                                                                                                      today at 4:40 PM

                                                                                                                                                                                      Making the expert selection more predictable also means making it less effective. There's no real free lunch.

                                                                                                                                                                      • Aurornis

                                                                                                                                                                        today at 4:10 PM

                                                                                                                                                                        It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.

                                                                                                                                                                        It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.

                                                                                                                                                                          • zozbot234

                                                                                                                                                                            today at 4:27 PM

                                                                                                                                                                            If the bottleneck is storage bandwidth that's not "slow". It's only slow if you insist on interactive speeds, but the point of this is that you can run cheap inference in bulk on very low-end hardware.

                                                                                                                                                                              • Aurornis

                                                                                                                                                                                today at 6:19 PM

                                                                                                                                                                                > If the bottleneck is storage bandwidth that's not "slow"

                                                                                                                                                                                It is objectively slow at around 100X slower than what most people consider usable.

                                                                                                                                                                                The quality is also degraded severely to get that speed.

                                                                                                                                                                                > but the point of this is that you can run cheap inference in bulk on very low-end hardware.

                                                                                                                                                                                You always could, if you didn't care about speed or efficiency.

                                                                                                                                                                                  • zozbot234

                                                                                                                                                                                    today at 6:32 PM

                                                                                                                                                                                    You're simply pointing out that most people who use AI today expect interactive speeds. You're right that the point here is not raw power efficiency (having to read from storage will impact energy per operation, and datacenter-scale AI hardware beats edge hardware anyway by that metric) but the ability to repurpose cheaper, lesser-scale hardware is also compelling.

                                                                                                                                                                                • Terretta

                                                                                                                                                                                  today at 5:02 PM

                                                                                                                                                                                  > very low-end hardware

                                                                                                                                                                                  iPhone 17 Pro outperforms AMD’s Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...

                                                                                                                                                                                    • pinkgolem

                                                                                                                                                                                      today at 5:24 PM

                                                                                                                                                                                      In single threaded workloads, still impressive

                                                                                                                                                                          • t00

                                                                                                                                                                            today at 6:19 PM

                                                                                                                                                                            /FIFY A year ago this would have been considered impossible. The software is moving faster than anyone's hardware assumptions.

                                                                                                                                                                            • ottah

                                                                                                                                                                              today at 4:42 PM

                                                                                                                                                                              I mean, by any reasonable standard it still is. Almost any computer can run an llm, it's just a matter of how fast, and 0.4k/s (peak before first token) is not really considered running. It's a demo, but practically speaking entirely useless.

                                                                                                                                                                                • alephnerd

                                                                                                                                                                                  today at 5:14 PM

                                                                                                                                                                                  Devils advocate - this actually shows how promising TinyML and EdgeML capabilities are. SoCs comparable to the A19 Pro are highly likely to be commodified in the next 3-5 years in the same manner that SoCs comparable to the A13 already are.

                                                                                                                                                                              • iberator

                                                                                                                                                                                today at 5:50 PM

                                                                                                                                                                                Does iPhone have some kind of hardware acceleration for neural netwoeks/ai ?

                                                                                                                                                                            • HardCodedBias

                                                                                                                                                                              today at 6:01 PM

                                                                                                                                                                              The power draw is going to be crazy (today).

                                                                                                                                                                              Practical LLMs on mobile devices are at least a few years away.

                                                                                                                                                                              • pier25

                                                                                                                                                                                today at 3:37 PM

                                                                                                                                                                                https://xcancel.com/anemll/status/2035901335984611412

                                                                                                                                                                                  • dang

                                                                                                                                                                                    today at 4:19 PM

                                                                                                                                                                                    Added to toptext. Thanks!

                                                                                                                                                                                • Yanko_11

                                                                                                                                                                                  today at 6:01 PM

                                                                                                                                                                                  [dead]

                                                                                                                                                                                  • jlhawn

                                                                                                                                                                                    today at 6:06 PM

                                                                                                                                                                                    [dead]

                                                                                                                                                                                    • aplomb1026

                                                                                                                                                                                      today at 5:32 PM

                                                                                                                                                                                      [dead]

                                                                                                                                                                                      • jee599

                                                                                                                                                                                        today at 3:23 PM

                                                                                                                                                                                        [dead]

                                                                                                                                                                                        • literoldolphin

                                                                                                                                                                                          today at 6:13 PM

                                                                                                                                                                                          [dead]

                                                                                                                                                                                          • anemll

                                                                                                                                                                                            today at 2:30 PM

                                                                                                                                                                                            [flagged]

                                                                                                                                                                                              • lostmsu

                                                                                                                                                                                                today at 3:02 PM

                                                                                                                                                                                                This has nothing to do with Apple, and everything to do with MoE and that everyone forgot you can re-read the necessary bits of the model from disk for each token.

                                                                                                                                                                                                This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.

                                                                                                                                                                                                But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.

                                                                                                                                                                                            • rwaksmunski

                                                                                                                                                                                              today at 3:19 PM

                                                                                                                                                                                              Apple might just win the AI race without even running in it. It's all about the distribution.

                                                                                                                                                                                                • dzikimarian

                                                                                                                                                                                                  today at 3:51 PM

                                                                                                                                                                                                  Because someone managed to run LLM on an iPhone at unusable speed Apple won AI race? Yeah, sure.

                                                                                                                                                                                                    • naikrovek

                                                                                                                                                                                                      today at 3:56 PM

                                                                                                                                                                                                      whoa, save some disbelief for later, don't show it all at once.

                                                                                                                                                                                                  • raw_anon_1111

                                                                                                                                                                                                    today at 3:28 PM

                                                                                                                                                                                                    Apple is already one of the winners of the AI race. It’s making much more profit (ie it ain’t losing money) on AI off of ChatGPT, Claude, Grok (you would be surprised at how many incels pay to make AI generated porn videos) subscriptions through the App Store.

                                                                                                                                                                                                    It’s only paying Google $1 billion a year for access to Gemini for Siri

                                                                                                                                                                                                      • detourdog

                                                                                                                                                                                                        today at 3:34 PM

                                                                                                                                                                                                        Apple’s entire yearly capex is a fraction of the AI spend of the persumed AI winners.

                                                                                                                                                                                                          • foobiekr

                                                                                                                                                                                                            today at 4:09 PM

                                                                                                                                                                                                            Fantasy buildouts of hundreds of billions of dollars for gear that has a 3 year lifetime may be premature.

                                                                                                                                                                                                            Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.

                                                                                                                                                                                                            • devmor

                                                                                                                                                                                                              today at 3:52 PM

                                                                                                                                                                                                              Which is mostly insane amounts of debt leveraged entirely on the moonshot that they will find a way to turn a profit on it within the next couple years.

                                                                                                                                                                                                              Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.

                                                                                                                                                                                                          • qingcharles

                                                                                                                                                                                                            today at 3:51 PM

                                                                                                                                                                                                            Plus all those pricey 512GB Mac Studios they are selling to YouTubers.

                                                                                                                                                                                                              • giobox

                                                                                                                                                                                                                today at 4:54 PM

                                                                                                                                                                                                                Most of the influencer content I saw demonstrating LLMs on multiple 512gb Mac Studios over Thunderbolt networking used Macs borrowed from Apple PR that were returned afterwards - network chuck, Jeff Geerling et al didn't actually buy the 4 or 5 512gb Mac Studios used in their corresponding local LLM videos.

                                                                                                                                                                                                                The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.

                                                                                                                                                                                                                • icedchai

                                                                                                                                                                                                                  today at 4:16 PM

                                                                                                                                                                                                                  They don't offer the 512 gig RAM variant anymore. Outside of social media influencers and the occasional AI researcher, the market for $10K desktops is vanishingly small.

                                                                                                                                                                                                                    • spacedcowboy

                                                                                                                                                                                                                      today at 5:01 PM

                                                                                                                                                                                                                      Huh, interesting. I wonder if there's a premium price right now for the one on my desk...

                                                                                                                                                                                                                      Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...

                                                                                                                                                                                                                      • criddell

                                                                                                                                                                                                                        today at 4:46 PM

                                                                                                                                                                                                                        The best desktop you could get has been around $10k going back all the way back to the PDP-8e (it could fit on most desks!).

                                                                                                                                                                                                                        • Multiplayer

                                                                                                                                                                                                                          today at 4:27 PM

                                                                                                                                                                                                                          My understanding is that the 512gb offering will likely return with the new M5 Ultra coming around WWDC in June. Fingers crossed anyway!

                                                                                                                                                                                                          • simopa

                                                                                                                                                                                                            today at 2:57 PM

                                                                                                                                                                                                            It's crazy to see a 400B model running on an iPhone. But moving forward, as the information density and architectural efficiency of smaller models continue to increase, getting high-quality, real-time inference on mobile is going to become trivial.

                                                                                                                                                                                                              • anemll

                                                                                                                                                                                                                today at 5:49 PM

                                                                                                                                                                                                                Probably 2x speed for Mac Studio this year if they do double NAND ( or quad?)

                                                                                                                                                                                                                • volemo

                                                                                                                                                                                                                  today at 4:23 PM

                                                                                                                                                                                                                  > moving forward, as the information density and architectural efficiency of smaller models continue to increase

                                                                                                                                                                                                                  If they continue to increase.

                                                                                                                                                                                                                    • vessenes

                                                                                                                                                                                                                      today at 5:00 PM

                                                                                                                                                                                                                      They will. Either new architectures will come out that give us greater efficiency, or we will hit a point where the main thing we can do is shove more training time onto these weights to get more per byte. Similar thing is already happening organically when it comes to efficient token use; see for instance https://github.com/qlabs-eng/slowrun.

                                                                                                                                                                                                                        • simopa

                                                                                                                                                                                                                          today at 5:20 PM

                                                                                                                                                                                                                          Thanks for the link.

                                                                                                                                                                                                                      • simopa

                                                                                                                                                                                                                        today at 5:50 PM

                                                                                                                                                                                                                        The "if" is fair. But when scaling hits diminishing returns, the field is forced to look at architectures with better capacity-per-parameter tradeoffs. It's happened before, maybe it'll happen again now.