Tiled Hacker news on React Router

HipKittens: Fast and furious AMD kernels

199 points - yesterday at 2:27 AM

Source

georgehotz
today at 1:23 AM
Full disclosure, we have a contract with AMD to get Llama 405B training on MI350X on MLPerf.
Things are turning around for AMD. If you have an AMD card, go to pytorch.org, click Linux+ROCm and install PyTorch. 3 years ago, this was hopeless. Today, most mainline things work. I ran nanochat on MI300X and it just worked. I think that's true about MI350X now too. The MI350X machine is stable.
They are clearly behind NVIDIA, nobody doubts that. And a lot of investment into software will be required to catch up, ecosystem, compiler, and driver. But 2 years ago they seemed hopeless, now they don't. Things take time. HipKittens is a great codebase to study to see where AMD's LLVM backend is still lacking; compare it to the CUDA Kittens.
For training, it's NVIDIA and Google in first. AMD in second. And nobody in third. Intel and Tenstorrent are not remotely close. Huawei examples segfaulted. Groq gave up selling chips. Cerebras isn't available anywhere. Trainium had a 5 day wait time to get one instance and I lost interest.
bratao
yesterday at 10:38 PM
One thing I don't understand about Nvidia’s valuation is that right now a small number of algorithms have 'won,' such as Transformers. The data is very important. Compared to the past where customized code was much more common, such as modeling code and HPC, the ecosystem was very important and it was almost impossible to implement all CUDA and related code.
Competitors now only need to optimize for a narrow set of algorithms. If a vendor can run vLLM and Transformers efficiently, a massive market becomes available. Consequently, companies like AMD or Huawei should be able to catch up easily. What, then, is Nvidia’s moat? Is InfiniBand enough?"
wewewedxfgdf
yesterday at 10:49 PM
You'd think AMD would swing in on something like this and fund it with the money needed to succeed. I have no knowledge of it but my guess is no, AMD never misses an opportunity to miss an opportunity - when it comes to GPUs and AI.
semessier
today at 6:19 AM
without having implemented inference, just by looking at it from a math perspective this is base linear algebra/BLAS. I am very much wondering what a lean inference optimized API with covering 80% of all use cases across dtypes and sparsity would look like. Probably a far cry from what's in CUDA and probably all that's needed for practical inference.
999900000999
today at 7:13 AM
With these new developments, are there any implications for getting LLMs running well on consumer AMD chips ?
For example, the following laptop which I'm thinking of picking up, has both a strong AMD CPU/IGPU and a RTX 5080. Could we see the AMD side competing with the RTX?
I know a dedicated gpu will always be faster though.
>HP OMEN MAX 16-ak0003nr 16" Gaming Laptop Computer - Shadow Black Aluminum AMD Ryzen AI 9 HX 375 (2.0GHz) Processor; NVIDIA GeForce RTX 5080 16GB GDDR7; 32GB DDR5-5600 RAM; 1TB Solid State Drive
LtdJorge
yesterday at 10:53 PM
Ahh, composable-kernel. The highest offender in the list of software that have produced unrecoverable OOMs in my Gentoo system (it’s actually Clang while compiling CK, which uses upwards of 2.5GB per thread).
villgax
today at 1:52 AM
Totally ignored B300 for some reason

HipKittens: Fast and furious AMD kernels

georgehotz

WithinReason

fulafel

georgehotz

latchkey

ivape

latchkey

bratao

vagab0nd

jillesvangurp

wmf

patagurbon

wmf

mandelken

mattlondon

knowitnone3

fooblaster

LtdJorge

observationist

toasterlovin

bigyabai

bryanlarsen

ehnto

mountainriver

ekropotin

ivape

o11c

wewewedxfgdf

AMDAnon

schainks

JonChesterfield

slavik81

AMDAnon

observationist

AMDAnon

Aurornis

FuckButtons

sho

latchkey

BNE

FuckButtons

0manrho

AMDAnon

BNE

0manrho

elteto

bryanlarsen

wmobit

elteto

Mehvix

suprjami

wmf

LtdJorge

semessier

999900000999

ehnto

christkv

electroglyph

ehnto

LtdJorge

villgax