Tiled Hacker news on React Router

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

297 points - 03/06/2025

Source

hnfong
03/06/2025
As other commenters have mentioned, the performance of this set up is probably not really great since there's not enough VRAM and lots of bits have to be moved between CPU and GPU RAM.
That said, there are sub-256GB quants of DeepSeek-R1 out there (not the distilled versions). See https://unsloth.ai/blog/deepseekr1-dynamic
I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.
Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...
DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.
colorant
03/06/2025
https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...
Requirements (>8 token/s):
380GB CPU Memory
1-8 ARC A770
500GB Disk
jamesy0ung
03/06/2025
What exactly does the Xeon do in this situation, is there a reason you couldn't use any other x86 processor?
Gravityloss
03/06/2025
I'm sure this question has been asked before, but why not launch a GPU with more but slower ram? That would fit bigger models while still affordable...
yongjik
03/06/2025
Did DeepSeek learn how to name their models from OpenAI.
CamperBob2
03/06/2025
Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)
notum
03/06/2025
Censoring of token/s values in the sample output surely means this runs great!
03/06/2025
mrbonner
03/06/2025
I see there are a few options to run inference for LLM and Stable Diffusion outside Nvidia. There is Intel Arc, Apple Ms and now AMD Ryzen AI Max. It is obvious that running in Nvidia would be the most optimal way. But given the availability of high VRAM Nvidia cards at reasonable price, I can't stop thinking about getting one that is not Nvidia. So, if I'm not interested in training or fine tuning, would any of those solutions actually works? On a Linux machine?
andrewstuart
03/06/2025
With the arrival of APUs for AI everyone is going to lose interest in GPUs real fast.
Why buy an overpriced Nvidia 4090 when you can get an AMD Halo Strix or Apple M3 Studio APU with 512GB or 128GB of Ram?
Nvidia has kept prices high and performance low for as long as it can and finally competition is here.
Even Intel can make APUs with tons of RAM.
Nvidia hopefully is squirming.
7speter
03/06/2025
I’ve been following the progress Intel Arc support in Pytorch is making, at least in Linux, and it seems like if things stay on track, we may see the first version of pytorch with full Xe/Arc support by around June. I think I’m just going to wait until then instead of dealing with anything ipex or openvino.
03/06/2025
03/06/2025
ryao
03/06/2025
Where is the benchmark data?
DeathArrow
03/06/2025
Any chance of using a couple of 3090 with Deepseek and fit the whole thing in the video RAM? I'm thinking to something like a software or "fake" NVLink.
03/06/2025
zamadatix
03/06/2025
Since the Xeon alone could run the model in this set up it'd be more interesting if they compared the performance uplift with using 0/1/2..8 Arc A770 GPUs.
Also, it's probably better to link straight to the relevant section https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...
chriscappuccio
03/06/2025
Better to run the Q8 model on an epyc pair with 768GB, you'll get the same performance
03/06/2025
anacrolix
03/06/2025
Now we just need a model that can actually code
03/06/2025
superkuh
03/06/2025
No... this headline is incorrect. You can't do that. I think they've confused the performance of running one of the small distills to existing smaller models. Two Arc cards cannot fit a 4 bit k-quant of a 671b model.
But a portable (no install) way to run llama.cpp on intel GPUs is really cool.

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

hnfong

SlavikCA

idonotknowwhy

colorant

pinoy420

colorant

colorant

aurareturn

colorant

moffkalast

colorant

GTP

colorant

faizshah

ynniv

aurareturn

hnuser123456

aurareturn

evilduck

miklosz

walrus01

aurareturn

refulgentis

aurareturn

refulgentis

aurareturn

xoranth

refulgentis

utopcell

yvdriess

utopcell

jamesy0ung

VladVladikoff

genewitch

hedora

npodbielski

pshirshov

npodbielski

zargon

genewitch

numpad0

walrus01

numpad0

walrus01

genewitch

walrus01

genewitch

numpad0

genewitch

numpad0

genewitch

Gravityloss

TeMPOraL

ChocolateGod

antupis

cheschire

fleischhauf

andrewstuart

coolspot

Gravityloss

varelse

yongjik

vlovich123

CamperBob2

colorant

codetrotter

hedora

notum

mrbonner

999900000999

andrewstuart

7speter

colorant

ryao

DeathArrow

zamadatix

hmottestad

colorant

zamadatix