Tiled Hacker news on React Router

Lossless LLM compression for efficient GPU inference via dynamic-length float

411 points - 04/25/2025

Source

jhj
04/25/2025
This is just a consequence of the fact that bfloat16 has a very high dynamic range which is not all used. People like hyperparameters that look like 0.01 not 10^10, even though there is the same fractional precision available at each exponent and if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions).
Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.
This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.
Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).
badmonster
04/25/2025
What stands out most is the practical implication: enabling lossless inference of a 405B-parameter model on a single node with 8×80GB GPUs is wild. That’s a huge unlock for research labs and startups alike that want to run frontier models without massive infrastructure costs.
loufe
04/25/2025
I'm so grateful to live through such exciting times. I can open HN every two to some exciting new news about ML/transformer models. I really should read more into it, but does llama.cpp use a "custom kernel" per se, with cublas, or is it just making good use of the cublas kernal?
Animats
04/25/2025
Once this weight format war settles down, hardware can be built to support it. Presumably you want matrix multiply hardware optimized for whatever weight format turns out to be reasonably optimal.
aseligman
04/25/2025
Some additional context: many real world agent use cases struggle to balance quality, cost, and performance. This technique can help avoid the tradeoffs that quantization techniques introduce, including unpredictable results while you try cost optimize an agent. In some cases the cost savings can be significant using dfloat11 as you squeeze into more affordable GPUs.
* I work with xmad.ai
yjftsjthsd-h
04/25/2025
> Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models.
The context length alone probably makes it worthwhile even if your models fit in memory, but I'm curious if it improves tokens/sec even all on GPU, since in my very amateur understanding LLMs tend to be constrained by memory bandwidth?
wills_forward
04/25/2025
So this could universally decrease the memory requirements by un-quantitized LLMs by 30%? Seems big if true.
thund
04/25/2025
Is this different than ZipNN? https://arxiv.org/pdf/2411.05239
I see it mentioned but can’t understand if it’s based on it or different/better…
gitroom
04/26/2025
Pretty cool seeing how fast all this moves - feels like every week theres a new trick or hardware upgrade. I def get nerd sniped by these efficiency improvements lol.
mountainriver
04/25/2025
Is it possible to run this on new models? It seem like the code is only for inference, unless I’m misunderstanding
jsemrau
04/26/2025
I still hold the opinion that ternary instead of binary would lead to an even higher degree of compression.
firefoxd
04/26/2025
Someone has figured out how to compress images even further with LLMs. They promised to published a white paper since last year: https://getproxyai.com/blog/this-image-is-4KB
/s I'll show myself out
luotuoshangdui
04/25/2025
Does it affect speed?
aazo11
04/25/2025
This is a huge unlock for on-device inference. The download time of larger models makes local inference unusable for non-technical users.
iamnotagenius
04/25/2025
Interesting, but not exactly practical for a local LLM user, as 4-bit is how LLM's are run locally.
marksimi
04/25/2025
Time to (dynamically) float
newuser111
04/26/2025
[dead]
fxegdfvbfds
04/26/2025
[dead]
hchja
04/25/2025
This is pretty useless in any case that doesn’t involve BFloat16 models
anticensor
04/25/2025
This is just a VBR mode for neural networks. Not quite useful when inference is already quite slow.
Havoc
04/25/2025
I'm guessing by lossless they mean something other than what the word usually means in compression context?
>achieving near information-optimal compression without any loss of precision
So perhaps more lossless as in didn't lose perplexity/benchmarks?
In my mind lossless is precisely zero bits lost along the way.
ein0p
04/25/2025
Note that this is _way_ slower at small batch sizes you'd need for interactive use. At batch size 1 this seems to run at 1/3rd the speed of bf16 (so about 1/6th the speed of fp8 you'd realistically be using) if figure 5 is to be believed. This is actually a pretty impressive feat in itself if you know anything about GPU kernel programming, but it is much slower nevertheless. For this to work at "wire speed" it'd need hardware support, which takes years. Their "baseline" elsewhere in the paper is CPU offloading, which is dog slow and can't be made fast due to PCIe bottleneck.

Lossless LLM compression for efficient GPU inference via dynamic-length float

jhj

iandanforth

VladVladikoff

vessenes

zorgmonkey

eln1

bjornsing

ironbound

refibrillator

Dylan16807

brookst

liuliu

hinkley

boulos

badmonster

latchkey

airstrike

latchkey

sundarurfriend

latchkey

Ringz

latchkey

saagarjha

latchkey

zarathustreal

latchkey

zarathustreal

miohtama

daveguy

mhitza

Der_Einzige

LoganDark

gunalx

Der_Einzige

danielmarkbruce

jhj

danielmarkbruce

Dylan16807

danielmarkbruce

Dylan16807

striking

kadushka

latchkey

NBJack

latchkey

NBJack

DrillShopper

latchkey

danielmarkbruce

spoaceman7777

danielmarkbruce

loufe

jonplackett

loufe

Animats

eoerl

aseligman

yjftsjthsd-h

brigade

philjohn

hnuser123456

wills_forward

moffkalast

janalsncm

danielmarkbruce

BoorishBears

imtringued

danielmarkbruce

BoorishBears

danielmarkbruce

BoorishBears

danielmarkbruce

moffkalast

Der_Einzige

moffkalast

kridsdale3

danielmarkbruce

kadushka

omneity