\

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

238 points - today at 12:23 AM

Source
  • hnfong

    today at 4:29 AM

    As other commenters have mentioned, the performance of this set up is probably not really great since there's not enough VRAM and lots of bits have to be moved between CPU and GPU RAM.

    That said, there are sub-256GB quants of DeepSeek-R1 out there (not the distilled versions). See https://unsloth.ai/blog/deepseekr1-dynamic

    I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

    Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...

    DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.

      • SlavikCA

        today at 6:35 AM

        I tried that Unsloth R1 quantization on my dual Xeon Gold 5218 with 384 GB DDR4-2666 (about half of memory channels used, so not most optimal).

        Type IQ2_XXS / 183GB, 16k context:

        CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.

        CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.

        I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.

        • idonotknowwhy

          today at 8:53 AM

          Thanks a lot for the v2.5! I'll give that a whirl. Hopefully it's as coherent as v3.5 when quantized so small.

          > I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

          I run the Q2_K_XL and it's perfectly good for me. Where it lacks vs FP8 is in creative writing. If you prompt it with for a story a few times, then compare with FP8, you'll see what I mean.

          For coding, the 1.58bit clearly makes more errors than the Q2XXS and Q2_K_XL

      • colorant

        today at 1:37 AM

        https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

        Requirements (>8 token/s):

        380GB CPU Memory

        1-8 ARC A770

        500GB Disk

          • GTP

            today at 9:31 AM

            > 1-8 ARC A770

            To get more than 8 t/s, is one Intel Arc A770 enough?

              • colorant

                today at 9:46 AM

                Yes, but the context length will be limited due to VRAM constraint

            • colorant

              today at 1:41 AM

              Also see the demo from Jason Dai's post: https://www.linkedin.com/posts/jasondai_with-the-latest-ipex...

                • aurareturn

                  today at 3:49 AM

                  CPU inference is both bandwidth and compute constrained.

                  If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.

                    • colorant

                      today at 6:45 AM

                      Prompt length mainly impacts prefill latency (FTFF), not the decoding speed (TPOT)

              • faizshah

                today at 3:06 AM

                Anyone got a rough estimate of the cost of this setup?

                I’m guessing it’s under 10k.

                I also didn’t see tokens per second numbers.

                  • ynniv

                    today at 3:12 AM

                    It better be! AMD @ $2k: https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...

                      • aurareturn

                        today at 3:49 AM

                        This article keeps getting posted but it runs a thinking model at 3-4 tokens/s. You might as well take a vacation if you ask it a question.

                        It’s a gimmick and not a real solution.

                          • miklosz

                            today at 5:42 AM

                            Exactly! I run it on my old T7910 Dell workstation (2x 2697A V4, 640GB RAM) that I build for way less than a $1k. But so what, it's about ~2 tokens / s. Just like you said, it's cool that it's run at all, but that's it.

                            • hnuser123456

                              today at 4:17 AM

                              If you value local compute and don't need massive speed, that's still twice as fast as most people can type.

                                • aurareturn

                                  today at 5:33 AM

                                  Human typing speed is magnitudes slower than our eyes scanning for the correct answer.

                                  ChatGPT o3 mini high thinks at about 140 tokens/s by my estimation and I sometimes wish it can return answers quicker.

                                  Getting a simple prompt answer would take 2-3 minutes using the AMD system and forget about longer context.

                              • walrus01

                                today at 4:25 AM

                                It's meant to be a test/development setup for people to prepare the software environment and tooling for running the same on more expensive hardware. Not to be fast.

                                  • aurareturn

                                    today at 5:34 AM

                                    I remember people trying to run the game Crysis using CPU rendering. They got it to run and move around. People did it for fun and the "cool" factor. But no one actually played the game that way.

                                    It's the same thing here. CPUs can run it but only as a gimmick.

                                      • refulgentis

                                        today at 5:42 AM

                                        > It's the same thing here. CPUs can run it but only as a gimmick.

                                        No, that's not true.

                                        I work on local inference code via llama.cpp, on both GPU and CPU on every platform, and the bottleneck is much more ram / bandwidth than compute.

                                        Crappy Pixel Fold 2022 mid-range Android CPU gets you roughly same speed as 2024 Apple iPhone GPU, with Metal acceleration that dozens of very smart people hack on.

                                        Additionally, and perhaps more importantly, Arc is a GPU, not a CPU.

                                        The headline of the thing you're commenting on, the very first thing you see when you open it, is "Run llama.cpp Portable Zip on Intel GPU"

                                        Additionally, the HN headline includes "1 or 2 Arc 7700"

                                          • xoranth

                                            today at 7:56 AM

                                            > Crappy Pixel Fold 2022 mid-range Android CPU

                                            Can you share what LLMs do you run on such small devices/what user case they address?

                                            (Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)

                                            • aurareturn

                                              today at 5:46 AM

                                              It's both compute and bandwidth constrained - just like trying to run Crysis on CPU rendering.

                                              A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.

                                              However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.

                                              Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.

                                                • refulgentis

                                                  today at 6:04 AM

                                                  Okay, let's stipulate LLMs are compute and bandwidth sensitive (of course!)...

                                                  #1, should highlight it up front this time: We are talking about _G_PUs :)

                                                  #2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.

                                                  TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards

                                                    • aurareturn

                                                      today at 6:14 AM

                                                      1) This system mostly uses normal DDR RAM, not GPU VRAM.

                                                      2) M3 Ultra can load Deepseek R1 671B Q4.

                                                      Using a very large LLM across the CPU and GPU is not new. It's been done since the beginning of local LLMs.

                              • utopcell

                                today at 3:23 AM

                                What a teaser article! All this info for setting up the system, but no performance numbers.

                        • today at 3:16 AM

                      • Gravityloss

                        today at 8:32 AM

                        I'm sure this question has been asked before, but why not launch a GPU with more but slower ram? That would fit bigger models while still affordable...

                          • fleischhauf

                            today at 9:31 AM

                            they absolutely can build gpus with larger vram, they just don't have the competition to have to do so. it's much more profitable this way.

                            • ChocolateGod

                              today at 8:39 AM

                              Because then you would have less motivation to buy the more expensive GPUs.

                                • antupis

                                  today at 9:26 AM

                                  Yeah, Nvidia doesn't have any incentive to do that and AMD needs to get their shit together at software side.

                              • varelse

                                today at 8:43 AM

                                [dead]

                            • jamesy0ung

                              today at 1:54 AM

                              What exactly does the Xeon do in this situation, is there a reason you couldn't use any other x86 processor?

                                • VladVladikoff

                                  today at 1:58 AM

                                  I think it’s that most non Xeon motherboards don’t have the memory channels to have this much memory with any sort of commercially viable dimms.

                                    • genewitch

                                      today at 2:31 AM

                                      Pcie lanes

                                        • hedora

                                          today at 3:11 AM

                                          I was about to correct you because this doesn't use PCIe for anything, and then I realized Arc was a GPU (and they support up to 8 per machine).

                                          Any idea how many Arc's it takes to match an H100?

                                            • npodbielski

                                              today at 9:15 AM

                                              I am reading from time to time about multi GPU solution and last time I found some real life information about this (it was two 7900 xtx) the result was that performance is the same at best often it is slower. So even if you manage to slap like 8 cheap cards onto motherboard, even if you would somehow make it work (people have problems with such setups), even if this would work continuously without much problems (crashes, power consumption) performance would be just OK. I am not sure if spending 10k on such setup would be better than buying 10k card with 40gbs of RAM.

                                                • pshirshov

                                                  today at 9:55 AM

                                                  Ollama works fine with multi-gpu setups. Since rocm 6.3 everything is stable and you can mix different GPU generations. The performance is good enough for the models to be useful.

                                                  The only thing which doesn't work well is running on iGPUs. It might work but it's very unstable.

                                  • numpad0

                                    today at 7:22 AM

                                      DDR4 UDIMM is up to 32GB/module  
                                      DDR5 UDIMM is up to 64GB/module[0]  
                                      non-Xeon M/B has up to 4 UDIMM slots 
                                      -> non-Xeon is up to 128GB/256GB per node  
                                    
                                    Server motherboards have as many as 16 DIMM slots per socket with RDIMM/LRDIMM support, which allows more modules as well as higher capacity modules to be installed.

                                    0: there has been a 128GB UDIMM launch at peak COVID

                                    • walrus01

                                      today at 2:24 AM

                                      There's not much else (other than Epyc) in the way of affordably priced motherboards that have enough cumulative RAM. You can buy a used Dell dual socket older xeon CPU server with 512GB of RAM for test/development purposes for not very much money.

                                      Under $1500 (before adding video cards or your own SSD), easily, with what I just found in a few minutes of searching. I'm also seeing things with 1024GB of RAM for under $2000.

                                      You also want to have the capability for more than one full speed at minimum PCI-Express x16 3.0 card, which means you need enough PCI-E lanes, which you aren't going to find on a single socket Intel workstation motherboard.

                                      Here's a couple of somewhat randomly chosen examples with 512GB of RAM and affordably priced. they'll be power hungry, and noisy. Same general idea from other x86-64 hardware such as from hp, supermicro, etc. These are fairly common in quantity so I'm using them as a baseline for specification vs price. Configurations will be something with 16 x 32GB DDR4 DIMMs.

                                      https://www.ebay.com/itm/186991103256?_skw=dell+poweredge+t6...

                                      https://www.ebay.com/itm/235978320621?_skw=dell+poweredge+r7...

                                      https://www.ebay.com/itm/115819389940?_skw=dell+poweredge+r7...

                                        • numpad0

                                          today at 7:32 AM

                                          PowerEdge R series is significantly cheaper if you already have an ear protection

                                            • walrus01

                                              today at 8:50 AM

                                              yes, an R730 or R740 for instance. There's lots of used R630 and R640 with 512GB of RAM as well, but a 1U server is not the best thing to try putting gaming GPU type pci-express video cards into.

                                  • notum

                                    today at 9:52 AM

                                    Censoring of token/s values in the sample output surely means this runs great!

                                    • mrbonner

                                      today at 4:47 AM

                                      I see there are a few options to run inference for LLM and Stable Diffusion outside Nvidia. There is Intel Arc, Apple Ms and now AMD Ryzen AI Max. It is obvious that running in Nvidia would be the most optimal way. But given the availability of high VRAM Nvidia cards at reasonable price, I can't stop thinking about getting one that is not Nvidia. So, if I'm not interested in training or fine tuning, would any of those solutions actually works? On a Linux machine?

                                        • 999900000999

                                          today at 5:57 AM

                                          If you actually want to seriously do this, go with Nvidia.

                                          This article is basically Intel saying remember us, we made a GPU! And they make great budget cards, but the ecosystem is just so far behind.

                                          Honestly this is not something you can really do on a budget.

                                      • today at 2:13 AM

                                        • today at 1:13 AM

                                          • yongjik

                                            today at 3:02 AM

                                            Did DeepSeek learn how to name their models from OpenAI.

                                              • vlovich123

                                                today at 3:38 AM

                                                The convention is weird but it's pretty standard in the industry across all models, particularly GGUF. 671B parameters, quantized to 4 bits. The K_M terminology I believe is more specific to GGUF and describes the specific quantization strategy.

                                            • ryao

                                              today at 1:23 AM

                                              Where is the benchmark data?

                                              • zamadatix

                                                today at 1:31 AM

                                                Since the Xeon alone could run the model in this set up it'd be more interesting if they compared the performance uplift with using 0/1/2..8 Arc A770 GPUs.

                                                Also, it's probably better to link straight to the relevant section https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

                                                  • hmottestad

                                                    today at 2:27 AM

                                                    If you’re running just one GPU your context is limited to 1024 tokens, as far as I could tell. I couldn’t see what the context size is for more cards though.

                                                    • colorant

                                                      today at 1:39 AM

                                                      Yes, you are right. Unfortunately HN somehow truncated my original URL link.

                                                        • zamadatix

                                                          today at 1:40 AM

                                                          Sounds like submission "helper" tools are working about as well as normal :).

                                                          Did you have the chance to try this out yourself or did you just run across it recently?

                                                  • today at 12:23 AM

                                                    • today at 2:05 AM

                                                      • CamperBob2

                                                        today at 2:39 AM

                                                        Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)

                                                          • colorant

                                                            today at 2:58 AM

                                                            >8TPS at this moment on a 2-socket 5th Xeon (EMR)

                                                            • codetrotter

                                                              today at 3:00 AM

                                                              > the dual Epyc workstation recipe that was popularized recently

                                                              Anyone have a link to this one?

                                                        • anacrolix

                                                          today at 3:07 AM

                                                          Now we just need a model that can actually code

                                                            • ohgr

                                                              today at 8:11 AM

                                                              I'll settle with a much lower bar: an engineer that can tell the code the model generates is shit.

                                                                • brokegrammer

                                                                  today at 8:35 AM

                                                                  Most engineers can do that because it's way easier to find flaws in code you didn't write vs in ones that you write.

                                                                  My code is always perfect in my own eyes until someone else sees it.

                                                                    • ohgr

                                                                      today at 8:38 AM

                                                                      From experience, most engineers can do neither.

                                                          • chriscappuccio

                                                            today at 3:17 AM

                                                            Better to run the Q8 model on an epyc pair with 768GB, you'll get the same performance

                                                              • ltbarcly3

                                                                today at 4:48 AM

                                                                The Q8 model is totally different?

                                                                  • manmal

                                                                    today at 8:29 AM

                                                                    My experience with quantizations is that anything below 6 is noticeably worse. Coherence suffers. I’ve rarely gotten anything really useful out of a Q4 model, code wise. For transformations they are great though, eg convert JSON to Markdown and vice versa.

                                                                      • yieldcrv

                                                                        today at 9:44 AM

                                                                        I like Q5

                                                                        The sweet spot for me

                                                            • 7speter

                                                              today at 2:10 AM

                                                              I’ve been following the progress Intel Arc support in Pytorch is making, at least in Linux, and it seems like if things stay on track, we may see the first version of pytorch with full Xe/Arc support by around June. I think I’m just going to wait until then instead of dealing with anything ipex or openvino.

                                                                • colorant

                                                                  today at 2:58 AM

                                                                  This is based on llama.cpp

                                                              • superkuh

                                                                today at 1:10 AM

                                                                No... this headline is incorrect. You can't do that. I think they've confused the performance of running one of the small distills to existing smaller models. Two Arc cards cannot fit a 4 bit k-quant of a 671b model.

                                                                But a portable (no install) way to run llama.cpp on intel GPUs is really cool.

                                                                  • Cheer2171

                                                                    today at 1:14 AM

                                                                    You don't have to go that far down the page to see it is paging to system RAM:

                                                                    Requirements:

                                                                        380GB CPU Memory
                                                                        1-8 ARC A770
                                                                        500GB Disk

                                                                      • superkuh

                                                                        today at 1:17 AM

                                                                        Yep. That's why the headline is incorrect. 380GB of the model on CPU system RAM and 32GB on some ARC GPUs. The ratio, 380/32, is obvious. Most of the processing is being done on the CPU. The GPU are little bit icing in this context. Fast, sure, but having to wait for the CPU layers (that's how layer splits work with llama.cpp).

                                                                        I think changing the end of headline to "Xeon w/380GB RAM" would stop it from being incorrect and misleading.

                                                                          • ryao

                                                                            today at 1:19 AM

                                                                            What if it does not need to read from system RAM for every token by reusing experts whenever they just happen to be in VRAM from being used for the previous token? If the selected experts do not change often, this is doable on paper.

                                                                              • hmottestad

                                                                                today at 2:34 AM

                                                                                That’s probably the main performance benefit of using the GPU. If you’re changing the active expert for every single token then it wouldn’t be any faster than just running it on the CPU. Once you can reuse the active expert for two tokens you’re already going to be a lot faster than just the CPU.

                                                                                More GPUs let you keep more experts active at a time.

                                                                                • hexaga

                                                                                  today at 2:35 AM

                                                                                  Expert distribution should be approximately random token-by-token, so not likely.

                                                                              • Cheer2171

                                                                                today at 1:18 AM

                                                                                "with" does not mean "entirely on"

                                                                                Edit: but what you added in your edit is right, it would be more accurate to append the system ram requirement

                                                                        • ryao

                                                                          today at 1:17 AM

                                                                          It is theoretically possible. Each token only needs 37B parameters and if the same experts are chosen often, it would behave closer to a 37B model than a 671B model, since reusing experts can skip loads from system RAM.

                                                                          You might still be right since I have not confirmed that the selected experts change infrequently doing prompt processing / token generation, and someone could have botched the headline. However, treating Deepseek like llama 3 when reasoning about VRAM requirements is not necessarily correct.

                                                                            • hmottestad

                                                                              today at 2:41 AM

                                                                              If the same expert is chosen for two consecutive tokens then it’ll act like a 37B model running on the GPU for the second token since it doesn’t need to load that expert from the main RAM again.

                                                                              • superkuh

                                                                                today at 1:26 AM

                                                                                MoE is pretty enabling after you've spent all the extra $$$$ to stuff your server CPU memory channels with ram so it's possible to run at all. But it's still spending a lot of money which makes this a lot less novel or interesting than "just on 1~2 Arc A770" implies. Especially for the marginal performance that even 8-12 channels of CPU memory bandwidth gets you.

                                                                                  • utopcell

                                                                                    today at 3:17 AM

                                                                                    Actually, 384GiB is already <$400 [1].

                                                                                    [1] https://www.amazon.com/NEMIX-RAM-DDR4-2666MHz-PC4-21300-Redu...

                                                                                      • superkuh

                                                                                        today at 4:43 AM

                                                                                        A slow-end DDR4 speed older generation Xeon system is unlikely to be used by Intel for this benchmark. It's far more likely they used an expensive DDR5 modern Xeon with as many memory channels as they could. Single user LLM inference is memory bandwidth bottlenecked. I just can't see Intel using old/deprecated hardware. And if someone not Intel were to build a Xeon DDR4 system it wouldn't reach the DDR5 tokens/s speeds reported here.

                                                                                        The reason they used a Xeon is memory channels. Non-server CPUs only have 2 but modern Xeons have 8 to 12 depending on generation/type. And the Xeons with the most are the most $$$$ and it ends up cheaper to just get a GPU or dedicated accelerator.

                                                                                    • utopcell

                                                                                      today at 3:14 AM

                                                                                      Is this amount of RAM really that expensive? 6x 64GiB DDR4 DIMMs are < $1,000.

                                                                              • today at 1:16 AM

                                                                                • rgbrgb

                                                                                  today at 1:13 AM

                                                                                  yep, title is inaccurate. it's a distill into Qwen 7B DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf

                                                                                    • zamadatix

                                                                                      today at 1:27 AM

                                                                                      The document contains multiple sections. The initial section does reference DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf as the example model but if you continue reading further you'll see a section referencing running DeepSeek-R1-Q4_K_M.gguf plus claims several other variations have been tested.

                                                                                      It's a bit less exciting when you see they're just talking about offloading parts from the large amount of DRAM.

                                                                                        • genewitch

                                                                                          today at 2:35 AM

                                                                                          So you thought there was some magical way to get >600B parameters in a couple of GPUs?

                                                                                          Also, LM studio lets you run smaller models in front of larger ones, so I could see having a few GPU in front really speeding up using R1 for inference.

                                                                                            • zamadatix

                                                                                              today at 2:56 AM

                                                                                              I had also initially assumed the title was supposed to reference something new about running a distilled variant as well. When I finished reading through and found out the news was just that you can also do this sort of "split" setup with Intel gear too it removed any further hope of excitement.

                                                                                              DeepSeek employs multi-token prediction which enables self-speculative decoding without needing to employ a separate draft model. Or at least that's what I understood the value of multi-token prediction to be.

                                                                                              • hmottestad

                                                                                                today at 2:44 AM

                                                                                                The MoE architecture allows you to keep the entire active model on a single GPU. If two consecutive tokens use the same export then the second token is going to be much faster.

                                                                                                  • genewitch

                                                                                                    today at 3:36 AM

                                                                                                    I understand all that, I am talking about a separate feature that is possibly backported or from llama.cpp. Where you have a small model that runs first and that is checked by a large model. I've seen 30%+ speedups using like 1.5B in front of a 15B for example.

                                                                                                    Two GPUs or more mean you can start to "keep" one or more of the experts hot on a GPU as well.

                                                                                                    • utopcell

                                                                                                      today at 3:26 AM

                                                                                                      What is the probability of that happening?

                                                                                                        • zamadatix

                                                                                                          today at 4:13 AM

                                                                                                          DeepSeek V3/R1 uses 8 routed experts out of 256, so not all as often as one would like. That said, having even just a single GPU will greatly speed up prompt processing which is worth it even if the inference speed was the same.

                                                                                                          Ktransformers has a document about using CPU + a single 4090D to reach decent tokens/s but I'm not sure how much of the perf is due to the 4090D vs other optimizations/changes for the CPU side https://github.com/kvcache-ai/ktransformers/blob/main/doc/en... The final step of going to 6 experts instead of 8 feels like cheating (not a lossless optimization).

                                                                                                            • genewitch

                                                                                                              today at 4:18 AM

                                                                                                              where does 256 come from? it's repeated in here and elsewhere that a single expert is 37B sized, so you'd have to have way more than "several hundred billion parameters", to hold 256 of those? Maybe i don't understand the architecture, but if that's the case, then everyone repeating 37B doesn't, either.

                                                                                                                • zamadatix

                                                                                                                  today at 4:42 AM

                                                                                                                  I think this diagram from the DeepSeekMoE paper explains it the clearest: https://i.imgur.com/CRKttob.png The one on the right is how the feed forward layers of DeepSeek V3/R1 work, blue and green are experts, and everything in that right section is what counts as "active parameters".

                                                                                                                  K (K=8 for these models, but you can customize that if you want) experts of 256 per layer are activated at a time. The 256 comes from the model file, it's just how many they chose to build it with. In these models there is also 1 shared expert which is always active in the layer. The router picks which k routed experts to use each forward pass and then a gating mechanism combines the outputs. If you sum the 1 shared expert + K routed experts + router + output networks you end up with 37 B parameters active for each feed forward layer pass. The individual experts are therefore much smaller than the total (probably something like 4 B parameters each? I've never really checked that directly).

                                                                                                                  Or, for the short answer: "37 B is the active parameters of 9 experts + 'overhead', not the parameters of a single expert".