\

Lossless LLM compression for efficient GPU inference via dynamic-length float

371 points - yesterday at 6:20 PM

Source
  • jhj

    yesterday at 8:45 PM

    This is just a consequence of the fact that bfloat16 has a very high dynamic range which is not all used. People like hyperparameters that look like 0.01 not 10^10, even though there is the same fractional precision available at each exponent and if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions).

    Typical entropy of bfloat16 values seen in weights (and activations) are about 10-12 bits (only 65-75% or so of the value range is used in practice). Sign and mantissa bits tend to be incompressible noise.

    This has been exploited several times before in the context of both classical HPC and AI, with lossless compression work from Martin Burtscher's lab (https://userweb.cs.txstate.edu/~burtscher/), fpzip from LLNL (https://computing.llnl.gov/projects/fpzip) and my library dietgpu from 2021 (https://github.com/facebookresearch/dietgpu) which we used to speed training on a large GPU cluster by about 10% wall clock time overall by losslessly compressing all data prior to send and decompressing upon receive (e.g., gradients, weights from backup, etc), which is still computing the same thing as it did before as it is lossless.

    Also, rANS is more efficient and easier to implement in SIMD-like instruction sets than Huffman coding. It would reduce the performance latency/throughput penalties as well with DFloat11 (since we have to decompress before we do the arithmetic).

      • iandanforth

        yesterday at 10:05 PM

        For those who don't bother to click through profiles, Jeff really knows what he's talking about. Much of Meta/FAIR + community benefits from his code.

          • VladVladikoff

            today at 12:13 AM

            I really love HN for this reason. Full of some of the brightest minds on the internet. Often the comments have very interesting information, instead of stupid knee jerk reactions to post titles.

        • bjornsing

          today at 4:47 AM

          > if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions)

          I doubt that very much. Thing is that inputs are multiplied with weights and added together in a neural network layer, and then the output becomes the input of the next layer in a cycle that can repeat up to a hundred times or more. When you get to the final output layer that 10^6 factor has been applied so many times that it has snowballed to a 10^600 factor.

            • ironbound

              today at 7:41 AM

              The Deepseek v3 paper details a quantisation method of scaling after matmul but before accumulation to improve precision, this is different than normal GEMM as operations are left till the end, can read more in chapter 3.3 of the paper below.

              https://arxiv.org/html/2412.19437v2#S3

          • vessenes

            today at 12:33 AM

            Thanks Jeff -- can you point me to something written up about rANS? All I find on line is turbulence modeling solutions; I presume this is not what you're referring to.

            As we know, quantizations are a critical tool for local LLM runners; RAM is typically the gating factor. Are you aware of other better lossless compression of BF16 weights out there?

            The reason I ask is this Dfloat11 seems relatively easy to plug in to existing quantization workflows, but you seem dismissive of the paper -- I presume it's my gap in understanding, and I'd like to understand.

              • zorgmonkey

                today at 12:43 AM

                I don't know of any great write-ups unfortunately, but the rANS you're looking for is range asymmetric numeral systems.

            • refibrillator

              today at 2:48 AM

              Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others).

              Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.

              Classic comp sci tradeoff between space and speed, no free lunch, etc.

              • liuliu

                today at 5:38 AM

                That let you think if we can rewind the time, maybe we should just allocate one more bit for half precision (6 exp, 9 mantissa) and not doing this bfloat16 thing.

                • brookst

                  today at 3:49 AM

                  Thanks for the fantastic explanation!

                  Would it be more efficient to calculate some kind of per-model or per-layer mean, and then only specify standard deviations, maybe by fp8 or smaller?

                  • hinkley

                    today at 12:57 AM

                    Do you think there’s a call for introducing an even smaller float that can pack more values into a SIMD register? Like a 12 bit?

                      • boulos

                        today at 4:03 AM

                        The latest GPUs and TPUs support fp8. It's a big part of the efficiency gain in the latest systems. Blackwell also supports fp4.

                • badmonster

                  yesterday at 7:06 PM

                  What stands out most is the practical implication: enabling lossless inference of a 405B-parameter model on a single node with 8×80GB GPUs is wild. That’s a huge unlock for research labs and startups alike that want to run frontier models without massive infrastructure costs.

                    • latchkey

                      yesterday at 9:34 PM

                      > That’s a huge unlock for research labs and startups alike that want to run frontier models without massive infrastructure costs.

                      Or let one of the neoclouds take care of the infrastructure costs and rent it out from them. Disclosure: I run one of them.

                        • airstrike

                          yesterday at 9:45 PM

                          Keep up the great work! We need more of you and other players.

                          Some unsolicited feedback: I would suggest reworking your landing page so that the language is always from your customers' perspective. Your customers want to solve a real internal problem that they have. Talking about how great your company is will always have less impact than talking about how you know what that problem is and how you intend to solve it.

                          Your mission is relevant to you and your investors, not to your customers. They care about themselves.

                          Your "quick start" should be an interactive form. I shouldn't have to remember what to put in an email to reach out to you. Make it easy for me. Also move that to the front page, provide a few "standard" packages and a custom one. Reduce the friction to clicking the CTA.

                          Since your pricing is transparent, you should be able to tell me what that price will be before I even submit a request. I assume you're cheaper than the competition (otherwise why would I not go with them?) so make that obvious. Check out Backblaze's website for an example page: https://www.backblaze.com/cloud-storage/pricing

                          Shell out a few grand and hire a designer to make your page look more professional. Something like https://oxide.computer/ but with the points above, as they also make the same mistake of making their home page read like a pitch deck.

                            • latchkey

                              yesterday at 10:11 PM

                              Fantastic unsolicited feedback, I'm definitely taking this to heart!

                              Website is intended to be more like documentation instead of a pitch deck or useless splash with a contact us form. I dislike sites like Oxide, I scroll past and don't read or ingest any of the fancy parts. Of course, you're right, this probably needs to be less about me. =)

                              Friction definitely needs to be improved. That part is being worked on right now. Our intention is to be fully self-service, so that you don't have to talk to us at all, unless you want to. Credit card and go.

                              We recently lowered our prices to be competitive with the rest of the market vs. focusing on people who care more about what we offer. We weren't trying to be cheaper than everyone else, we were trying to offer a better service. Lesson learned and pricing adjusted. Streisand effect, I don't like to mention the other players much.

                              Again, thanks!

                          • sundarurfriend

                            yesterday at 11:45 PM

                            > neoclouds

                            For anyone else who hadn't heard of this term:

                            > Neoclouds are startups specializing in AI-specific cloud computing. Unlike their larger competitors, they don’t develop proprietary chips. Instead, they rely heavily on Nvidia’s cutting-edge GPUs to power their operations. By focusing solely on AI workloads, these companies offer specialized solutions tailored to AI developers’ needs.

                            from https://www.tlciscreative.com/the-rise-of-neoclouds-shaping-...

                          • Ringz

                            today at 12:51 AM

                            I need your services in Cape Town South Africa. It’s hard to find good data centers here.

                              • latchkey

                                today at 1:30 AM

                                Rent from us! hello@hotaisle.ai

                            • saagarjha

                              today at 1:11 AM

                              That just moves the infrastructure costs to your cloud bill.

                                • latchkey

                                  today at 1:29 AM

                                  True, but there is so much value that we provide above and beyond just a cloud bill, that I think it is worth it. This is way more than racking and stacking commodity servers and providing a ssh login.

                                  It is novel equipment that few have ever used before outside of a relatively small HPC community. It regularly breaks and has issues (bugs) that need industry relationships to manage properly. We've had one server down for over a month now cause SMCI can't get their sh/t together to fix it. That's a $250k+ 350lbs paperweight. Good luck to any other small company that wants to negotiate that relationship.

                                  We are offering a very valuable service by enabling easy access to some of the most powerful compute available today. How many people do you think have a good grasp of what it takes to configure rocev2 & 8x400G across a cluster of servers? Good luck trying to hire talent that can set that up, they already have jobs.

                                  The capex / opex / complexity involved with deploying this level of gear is huge and only getting larger as the industry shifts to bigger/better/faster (ie: air cooling is dead). Things are moving so quickly, that equipment you purchased a year ago is now already out of date (H100 -> H200 is a great example). You're going to have to have a pretty impressive depreciation model to deploy this yourself.

                                  I wouldn't just dismiss this as moving costs around.

                                    • zarathustreal

                                      today at 11:30 AM

                                      wait your competitive advantage is “human friction exists”?

                                      …how do you justify marketing yourself in a system like that?

                                      “In general, people in this vertical have difficulty doing their jobs. Luckily we’ve had drinks with most of them” ……

                          • miohtama

                            yesterday at 9:08 PM

                            I am not expert here, so want to ask what's magical about 405B number?

                              • daveguy

                                yesterday at 9:18 PM

                                That's the size of the largest, most capable, open source models. Specifically Llama 3.1 has 405B parameters. Deepseek's largest model is 671B parameters.

                                  • mhitza

                                    yesterday at 9:42 PM

                                    Small corrections. Llama 3.1 is not an Open Source model, but a Llama 3.1 Licensed model. Neither is DeepSeek apparently https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/LIC... which I was of the false opinion that it is. Though I never considered using it, so haven't checked the license before.

                                      • gunalx

                                        yesterday at 10:58 PM

                                        Both deepseek R1 and V3-0324 is mit licensed.

                                        • Der_Einzige

                                          today at 2:24 AM

                                          You can just ignore the license since the existence of these models is based on piracy at a scale never before seen. Aaron Swartz couldn’t have even imagined violating copyright that hard.

                                          If you live in a glass house, you won’t throw stones. No one in the LLM space wants to be litigious

                                          It’s an open secret that DeepSeek used a ton of OpenAI continuations both in pre training and in the distillation. That totally violates openAI TOS. No one cares.

                                            • LoganDark

                                              today at 3:34 AM

                                              > No one in the LLM space wants to be litigious

                                              Except for OpenAI.

                              • Der_Einzige

                                today at 2:22 AM

                                4 but quants of DeepSeek or llama3 405n already fit on those GPUs and purported to have almost 0 loss compared to the full model. Doesn’t seem like that big of a deal given this

                                • danielmarkbruce

                                  yesterday at 7:44 PM

                                  It's... useful right now...it's not a huge unlock in a world where model size, GPU memory size, different precision support are changing quickly.

                                    • jhj

                                      yesterday at 11:33 PM

                                      Unlike quantization, dimensionality reduction/low rank approximation, distillation etc, lossless compression is an always-correct addition to any ML system as you are computing the same thing you did before, the only question is if it is fast enough to not cause substantial bottlenecks and if the achievable compression ratio is high enough to be useful.

                                      Floating point is just an inefficient use of bits (due to excessive dynamic range), especially during training, so it will always be welcome there. Extreme quantization techniques (some of the <= 4-bit methods, say) also tend to increase entropy in the weights limiting the applicability of lossless compression, so lossless and lossy compression (e.g., quantization) sometimes go against each other.

                                      If you have billions in dollars in inference devices, even reducing the number of devices you need for a given workload by 5% is very useful.

                                      • striking

                                        yesterday at 7:51 PM

                                        Is GPU memory size really changing that quickly? For that matter, is model size?

                                          • kadushka

                                            yesterday at 8:07 PM

                                            What's rapidly changing are quantization algorithms, and hardware features to support those algorithms. For example, Blackwell GPUs support dynamic FP4 quantization with group size 16. At that group size it's close to lossless (in terms of accuracy metrics).

                                            • latchkey

                                              yesterday at 9:25 PM

                                              Both AMD and Nvidia are dumping more and more memory into their GPUs.

                                              MI300x is 192GB HMB3, MI325x is 256 HMB3e, MI355x should be 288 HBM3e (and support FP4/6).

                                                • NBJack

                                                  yesterday at 9:58 PM

                                                  The professional side of things, yes. For consumer grade GPUs, despite the trends in gaming markets otherwise needing such, the values have stagnated a bit.

                                                    • latchkey

                                                      yesterday at 10:13 PM

                                                      I'm NDA with AMD and sadly can't mention details, but I can say the future is promising.

                                                        • DrillShopper

                                                          yesterday at 11:04 PM

                                                          I hope AMD cracks the CUDA Problem soon

                                            • danielmarkbruce

                                              yesterday at 8:28 PM

                                              Yes, yes.

                                              Nvidia about to release blackwell ultra with 288GB. Go back to maybe 2018 and max was 16gb if memory serves.

                                              DeepSeek recently release a 670 gb model. A couple years ago Falcon's 180gb seemed huge.

                                                • spoaceman7777

                                                  yesterday at 9:19 PM

                                                  I'd assume that, in the context of LLM inference, "recent" generally refers to the Ampere generation and later of GPUs, when the demand for on board memory went through the roof (as, the first truly usable LLMs were trained on A100s).

                                                  We've been stuck with the same general caps on standard GPU memory since then though. Perhaps limited in part because of the generational upgrades happening in the bandwidth of the memory, rather than the capacity.

                                                    • danielmarkbruce

                                                      yesterday at 9:26 PM

                                                      Bandwidth is going up too. "It's not doubling every 18 months and hence it's not moving" isn't a sensible way to view change.

                                                      A one time effective 30% reduction in model size simply isn't going to be some massive unlocker, in theory or in practice.

                                  • loufe

                                    yesterday at 6:46 PM

                                    I'm so grateful to live through such exciting times. I can open HN every two to some exciting new news about ML/transformer models. I really should read more into it, but does llama.cpp use a "custom kernel" per se, with cublas, or is it just making good use of the cublas kernal?

                                      • jonplackett

                                        yesterday at 7:15 PM

                                        It’s funny that you’re missing the time frame from your sentence.

                                        2 weeks? Two months? Two days? Two minutes?

                                        All of the above are true sometimes! Exciting times indeed.

                                          • loufe

                                            today at 5:21 AM

                                            Good catch, I meant every two days! :)

                                    • Animats

                                      yesterday at 9:35 PM

                                      Once this weight format war settles down, hardware can be built to support it. Presumably you want matrix multiply hardware optimized for whatever weight format turns out to be reasonably optimal.

                                        • eoerl

                                          yesterday at 10:53 PM

                                          Optimization is post hoc here : you have to train first to be able to huffman en ode, so it's not a pure format question

                                      • aseligman

                                        yesterday at 10:06 PM

                                        Some additional context: many real world agent use cases struggle to balance quality, cost, and performance. This technique can help avoid the tradeoffs that quantization techniques introduce, including unpredictable results while you try cost optimize an agent. In some cases the cost savings can be significant using dfloat11 as you squeeze into more affordable GPUs.

                                        * I work with xmad.ai

                                        • yjftsjthsd-h

                                          yesterday at 7:57 PM

                                          > Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models.

                                          The context length alone probably makes it worthwhile even if your models fit in memory, but I'm curious if it improves tokens/sec even all on GPU, since in my very amateur understanding LLMs tend to be constrained by memory bandwidth?

                                            • brigade

                                              today at 6:18 AM

                                              It does not; the decompression is memory to memory, one tensor at a time, so it’s worse. They claim less than 200 GB/s on an A100, and their benchmarks suggest it’s somewhere between 1.5-4x slower at batch size 1 depending on GPU and model. This overhead of course mostly disappears with a large enough batch size.

                                              Other lossless codecs can hit 600 GB/s on the same hardware, so there should be some room for improvement. But A100’s raw memory bandwidth is 1.6 TB/s

                                              • philjohn

                                                yesterday at 8:20 PM

                                                My mental model is saying it might do, much like on slow hard drives DoubleSpace in DOS slightly sped up loading data from disk.

                                                • hnuser123456

                                                  yesterday at 8:58 PM

                                                  If the model is 70% the size, it will be 1/0.7 = 1.43x the speed.

                                              • gitroom

                                                today at 3:57 AM

                                                Pretty cool seeing how fast all this moves - feels like every week theres a new trick or hardware upgrade. I def get nerd sniped by these efficiency improvements lol.

                                                • thund

                                                  yesterday at 11:19 PM

                                                  Is this different than ZipNN? https://arxiv.org/pdf/2411.05239

                                                  I see it mentioned but can’t understand if it’s based on it or different/better…

                                                    • thund

                                                      yesterday at 11:24 PM

                                                      Found it, the news reminded me of this paper https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7...

                                                      • jhj

                                                        yesterday at 11:28 PM

                                                        Not really, it's just adding some data transposition (coalescing individual bytes from the data words together) and an option to use a LZ/dictionary-type compressor to compress redundant things. But an LZ-type compressor doesn't make much sense on NN weights I think since it is not as redundant as most text data with many repeats, and also the space of possible dictionary matches is pretty small since unless the data is highly sparse, there may not be many repetitions that you can leverage to avoid the dictionary overhead.

                                                        If you add an LZ-type compressor and have this be in the critical path for inference, then decompression will be a lot slower. It would be best to fuse decompression with the compute kernels (e.g., a GEMM that performs decompression on each tile before the arithmetic), and the simpler the decompression routine, the easier this will be.

                                                    • wills_forward

                                                      yesterday at 7:01 PM

                                                      So this could universally decrease the memory requirements by un-quantitized LLMs by 30%? Seems big if true.

                                                        • moffkalast

                                                          yesterday at 7:28 PM

                                                          Not as big when Q8 quantization is already considered overkill and cuts it down to 50% (and a flat 2x speed boost without any additional compute overhead mind you) and the more common Q4KM is more like 30%. Definitely interesting if it can be added to existing quantization, but K quants do already use different precision levels for different layers depending on general perplexity impact which is similar to this entropy metric they use, e.g. Q6 using a mix of 4 bits and 8 bits. And that's not even considering calibrated imatrix which does something conceptually similar to FFT to compress even higher.

                                                            • janalsncm

                                                              yesterday at 7:36 PM

                                                              Quantization is not lossless.

                                                                • danielmarkbruce

                                                                  yesterday at 7:39 PM

                                                                  Nobody really cares if it meets a strict definition of lossless.

                                                                    • moffkalast

                                                                      yesterday at 7:53 PM

                                                                      And when you consider that the usual final step in the pipeline is that a sampler goes ham on the probabilities and just picks some random nonsense, the tolerance for lossy compression is fairly high.

                                                                      In fact, there's this funny occurrence where Q4 models on occasion perform better than their fp16 counterparts on benchmarks ran with top_k=1 since the outputs are slightly more random and they can less deterministically blunder past the local maximum into a more correct solution.

                                                                        • Der_Einzige

                                                                          today at 2:28 AM

                                                                          We got an oral at ICLR for calling out how shit samplers like top_p and top_k are. Use min_p!

                                                                            • moffkalast

                                                                              today at 7:13 AM

                                                                              True yep, I wish more people benchmarked models with more representative sampler settings and then took the average of 5 or 10 responses.

                                                                      • BoorishBears

                                                                        yesterday at 8:12 PM

                                                                        I do? I spend a ton of time post-training models for creative tasks.

                                                                        The effects of model quantization are usually qualified in terms of performance on benchmaxxed tasks with strong logit probabilities, temp 0, and a "right" answer the model has to pick. Or even worse they'll be measured on metrics that don't map to anything except themselves like perplexity (https://arxiv.org/pdf/2407.09141)

                                                                        I agree Q8 is strong but I also think the effects of quantization are constantly being underappreciated. People are often talking about how these models perform while fundamentally using 10+ variants of a single model with distinct performance profiles.

                                                                        Even knowing the bits per weight used isn't enough to know how exactly a given quant method is affecting the model: https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs

                                                                          • imtringued

                                                                            today at 9:35 AM

                                                                            If you've trained your own models you would be aware of quantization aware training.

                                                                            • danielmarkbruce

                                                                              yesterday at 8:37 PM

                                                                              "Nobody really cares if it meets a strict definition of lossless" != "quantization can be done haphazardly."

                                                                                • BoorishBears

                                                                                  yesterday at 8:54 PM

                                                                                  If you're trying to really snarkily refer to the article on Dynamic Quants 2.0 and how carefully developed they were, they're comparing their quants to the methodology 99.99% quants out there use.

                                                                                  The problem is not that people are making quants "haphazardly", it's that people keep parroting that various quants are "practically lossless" when they actually have absolutely no clue how lossy they are given how application specific the concept is for something as multidimensional as an LLM.

                                                                                  The moment anyone tries a little harder to quantify how lossy they are, we repeatedly find that the answer is "not any reasonably definition of lossless". Even in their example where Q4 is <1% away in MMLU 5-shot is probably massively helped by a calibration dataset that maps to MMLU-style tasks really well, just like constantly using WikiText massively helps models that were trained on... tons of text from Wikipedia.

                                                                                  So unless you're doing your own calibrated quantization with your own dataset (which is not impossible, but also not near common), even their "non-haphazard" method could have a noticeable impact on performance.

                                                                                    • danielmarkbruce

                                                                                      yesterday at 9:22 PM

                                                                                      Wasn't referring to that.

                                                                                      You are saying that people are using quantized models haphazardly and talking about them haphazardly. I'll grant it's not the exact same thing as making them haphazardly, but I think you took the point.

                                                                                      The terms shouldn't be used here. They aren't helpful. You are either getting good results or you are not. It shouldn't be treated differently from further training on dataset d. The weights changed - how much better or worse at task Y did it just get?

                                                                                        • BoorishBears

                                                                                          yesterday at 9:49 PM

                                                                                          The term is perfectly fine to use here because choosing a quantization strategy to deploy already has enough variables:

                                                                                          - quality for your specific application

                                                                                          - time to first token

                                                                                          - inter-token latency

                                                                                          - memory usage (varies even for a given bits per weight)

                                                                                          - generation of hardware required to run

                                                                                          Of those the hardest to measure is consistently "quality for your specific application".

                                                                                          It's so hard to measure robustly that many will take significantly worse performance on the other fronts just to not have to try to measure it... which is how you end up with full precision deployments of a 405b parameter model: https://openrouter.ai/meta-llama/llama-3.1-405b-instruct/pro...

                                                                                          When people are paying multiples more for compute to side-step a problem, language and technology that allows you to erase it from the equation is valid.

                                                                                            • danielmarkbruce

                                                                                              yesterday at 9:58 PM

                                                                                              You say that as though people know these things for the full precision deployment and their use case.

                                                                                              Some have the capability to figure it and can do it for both full precision and quantized. Most don't and cannot.

                                                                          • kridsdale3

                                                                            yesterday at 7:57 PM

                                                                            That's not true. If there are measurable performance differences.

                                                                              • danielmarkbruce

                                                                                yesterday at 8:33 PM

                                                                                "strict" means something. People, including yourself, only care if there is a practical difference in performance. "this is lossless and that isn't lossless" is a completely useless statement in this realm. In many domains lossy compression is either not tolerated, not legal or not practical.

                                                                                • kadushka

                                                                                  yesterday at 8:09 PM

                                                                                  If you get any accuracy degradation with full 8 bits of precision you're doing it wrong.

                                                                                    • omneity

                                                                                      yesterday at 9:27 PM

                                                                                      Or your model wasn't trained so well (weights are too spiky)

                                                                              • throwaway314155

                                                                                yesterday at 8:09 PM

                                                                                Seems reductive.

                                                                • firefoxd

                                                                  today at 3:35 AM

                                                                  Someone has figured out how to compress images even further with LLMs. They promised to published a white paper since last year: https://getproxyai.com/blog/this-image-is-4KB

                                                                  /s I'll show myself out

                                                                  • jsemrau

                                                                    today at 1:03 AM

                                                                    I still hold the opinion that ternary instead of binary would lead to an even higher degree of compression.

                                                                      • xmasotto

                                                                        today at 1:42 AM

                                                                        The underlying memory is still binary, or were you proposing an entirely new computer architecture with ternary gates?

                                                                  • mountainriver

                                                                    yesterday at 7:32 PM

                                                                    Is it possible to run this on new models? It seem like the code is only for inference, unless I’m misunderstanding

                                                                    • luotuoshangdui

                                                                      yesterday at 7:53 PM

                                                                      Does it affect speed?

                                                                      • aazo11

                                                                        yesterday at 9:58 PM

                                                                        This is a huge unlock for on-device inference. The download time of larger models makes local inference unusable for non-technical users.

                                                                        • marksimi

                                                                          yesterday at 7:42 PM

                                                                          Time to (dynamically) float

                                                                          • iamnotagenius

                                                                            yesterday at 6:57 PM

                                                                            Interesting, but not exactly practical for a local LLM user, as 4-bit is how LLM's are run locally.

                                                                              • sroussey

                                                                                yesterday at 7:08 PM

                                                                                True, but their research did include running on 5080 local.

                                                                                The big take away, in my opinion, is that their technique for LUTs etc could also be applied to lossy quants as well. Say maybe you get 5bit accuracy in size of 4bit?

                                                                                I don’t know, but maybe? Also their two stage design might make current quantized you kernal designs better.

                                                                                  • spindump8930

                                                                                    yesterday at 8:13 PM

                                                                                    Yes, it could be stacked on quants. It might be that quantized activations already are more "dense" and so they can't be compressed as much (from 16 -> ~11 bits), but certainly possible.

                                                                                      • jasonjmcghee

                                                                                        yesterday at 10:15 PM

                                                                                        I read it similarly - that this is a specific attribute of bfloat16, so the quants folks tend to run on local hardware don't have the same inefficiency to exploit

                                                                                • gojomo

                                                                                  yesterday at 7:28 PM

                                                                                  Some might prefer the fidelity of this method's 70% savings over the lossyness of 4-bit quantization's 75%.

                                                                                  And, maybe the methods stack for those willing to trade both costs for the smallest representation.

                                                                                    • svachalek

                                                                                      yesterday at 7:39 PM

                                                                                      This is only a 30% savings, which is a cool technical feat but hard to see a use case for.

                                                                                      • iamnotagenius

                                                                                        yesterday at 9:05 PM

                                                                                        [dead]

                                                                                • newuser111

                                                                                  today at 12:50 AM

                                                                                  [dead]

                                                                                  • fxegdfvbfds

                                                                                    today at 12:09 AM

                                                                                    [dead]

                                                                                    • ein0p

                                                                                      yesterday at 7:14 PM

                                                                                      Note that this is _way_ slower at small batch sizes you'd need for interactive use. At batch size 1 this seems to run at 1/3rd the speed of bf16 (so about 1/6th the speed of fp8 you'd realistically be using) if figure 5 is to be believed. This is actually a pretty impressive feat in itself if you know anything about GPU kernel programming, but it is much slower nevertheless. For this to work at "wire speed" it'd need hardware support, which takes years. Their "baseline" elsewhere in the paper is CPU offloading, which is dog slow and can't be made fast due to PCIe bottleneck.

                                                                                        • timschmidt

                                                                                          yesterday at 7:27 PM

                                                                                          It's perfectly possible to run LLMs quickly on CPUs. An Epyc or Xeon with 12 memory channels achieves similar memory bandwidth to a 4090, which is the limiting factor. Engineering sample Epycs in kits with motherboard and RAM are available on Aliexpress for reasonable prices even.

                                                                                            • ein0p

                                                                                              yesterday at 7:47 PM

                                                                                              Did I say it wasn't? If your context is short and your model is small, it is possible to run LLMs on high-end CPUs able to support 12 channels of high-spec DDR5 RDIMMs. It's not possible to run them as fast as they'd run on a GPU equipped with HBM though. Nor would it be even remotely as energy efficient. Also, it's not possible to run LLMs quickly on CPU if your context is long, because CPUs do not have the requisite FLOPS to process long context quickly. And before you bring MoE into the conversation, MoE only affects the feedforward part of each transformer block, and full memory bandwidth and compute savings are only realized at batch size 1, sequence length 1, AKA the most inefficient mode that nobody other than Ollama users use in practice. Sequence length 8 (common for speculative decoding) could be using up to 8x37B parameters (assuming you want to run DeepSeek - the strongest available open weights model). Batch size of even 2 with sequence length 8 could use almost all parameters if you're particularly unlucky. Prompt will almost certainly use all parameters, and will slam into the FLOPS wall of your EPYC's ALUs. So can LLMs (with an emphasis on "Large") be run on CPUs? Yes. Are you going to have a good time running them this way? No.

                                                                                                • timschmidt

                                                                                                  yesterday at 9:59 PM

                                                                                                  llamafile contains specific optimizations for prompt processing using AVX512 for dealing with just this issue: https://justine.lol/matmul/ (about a 10x speedup over llama.cpp)

                                                                                                  Somewhere between 8 and 192 cores I'm sure there's enough AVX512 to get the job done. And we've managed to reinvent Intel's Larrabee / Knights concept.

                                                                                                  Sadly, the highly optimized AVX512 kernels of llamafile don't support these exotic floats yet as far as I know.

                                                                                                  Yes, energy efficiency per query will be terrible compared to a hyperscaler. However privacy will be perfect. Flexibility will be higher than other options - as running on the CPU is almost always possible. Even with new algorithms and experimental models.

                                                                                                    • ein0p

                                                                                                      yesterday at 10:15 PM

                                                                                                      At 192 cores you're way better off buying a Mac Studio, though.

                                                                                          • ow5

                                                                                            yesterday at 8:39 PM

                                                                                            Hi! one of the contributors to the paper — we have kernels not released yet that can shave down decoding latency by >20%.

                                                                                            Also when we ran experiments for streaming with the current kernels, we were median ~1.3x slower at inference

                                                                                              • ein0p

                                                                                                yesterday at 9:34 PM

                                                                                                Thanks for chiming in! How do you explain the top-most graph in Figure 5? Am I misreading it?

                                                                                        • hchja

                                                                                          yesterday at 7:53 PM

                                                                                          This is pretty useless in any case that doesn’t involve BFloat16 models

                                                                                            • spindump8930

                                                                                              yesterday at 8:12 PM

                                                                                              bf16 is the defacto default datatype and distribution type for LLMs, which are then often eagerly quantized by users with more limited hardware. See the recent Llama releases and e.g. the H100 spec sheet (advertised flops and metrics target bf16).

                                                                                              • throwaway314155

                                                                                                yesterday at 8:08 PM

                                                                                                So an increasingly smaller number of cases?

                                                                                            • anticensor

                                                                                              yesterday at 8:45 PM

                                                                                              This is just a VBR mode for neural networks. Not quite useful when inference is already quite slow.

                                                                                                • vessenes

                                                                                                  today at 12:36 AM

                                                                                                  Even presuming this is an accurate summary, the conclusion is not accurate - most local LLM inference users are constantly trading off quality for speed, in that speed drops dramatically once RAM is full. So, if you think of speed at desired quality, this could be very useful.

                                                                                              • Havoc

                                                                                                yesterday at 6:58 PM

                                                                                                I'm guessing by lossless they mean something other than what the word usually means in compression context?

                                                                                                >achieving near information-optimal compression without any loss of precision

                                                                                                So perhaps more lossless as in didn't lose perplexity/benchmarks?

                                                                                                In my mind lossless is precisely zero bits lost along the way.

                                                                                                  • artemisart

                                                                                                    yesterday at 7:12 PM

                                                                                                    The first sentence of the introduction ends with "we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model" so yes it's lossless.

                                                                                                    • Vendan

                                                                                                      yesterday at 7:03 PM

                                                                                                      information-optimal compression is "the theoretical minimum number of bits needed to represent data without losing any information, based on the data's entropy", so I think they mean the same thing you do

                                                                                                        • brokencode

                                                                                                          yesterday at 7:22 PM

                                                                                                          Yeah, they’re saying that this compression is almost as good as is theoretically possible without losing any information.

                                                                                                      • vintermann

                                                                                                        yesterday at 7:36 PM

                                                                                                        A good example that information, i.e. bits, are only meaningful with respect to an end. If you don't know what the bits in a float will be used to, you can't throw them away, but if the floats are in a function, and you know that what some bits are can't affect the output of the function regardless of input, then you can throw those bits away and still have a lossless compression of the function.

                                                                                                        • 8ytecoder

                                                                                                          yesterday at 7:10 PM

                                                                                                          Think Morse code, where frequently used letters have shorter codes than less frequent ones. This ensures zero loss of information.

                                                                                                          • ziddoap

                                                                                                            yesterday at 7:28 PM

                                                                                                            The part you quote is a few sentences past the sentence that says "preserving outputs that are bit-for-bit identical to the original model".