\

Speculative KV coding: losslessly compressing KV cache by up to ~4×

125 points - last Thursday at 3:29 PM

Source
  • zozbot234

    today at 9:03 AM

    The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.

      • zozbot234

        today at 3:14 PM

        BTW, I forgot to mention that you can make this work in a way, but only if your model architecture generalizes the context and attention mechanism such that it's no longer a pure sequence. So you could have a large amount of distinct "early" token sequences, with each being self-contained and not depending on any other tokens, e.g. your source code files might be such. Then later parts of the context would of course depend on all of those files as usual. This makes prefill for the earlier context both reusable and cheaply recomputable throughout, at the cost of losing some dependencies that would've been previously accounted for: your model becomes faster and more efficient, but perhaps not quite as smart.

        • saagarjha

          today at 10:54 AM

          Sure, but any classical attention mechanism is quadratic in context length.

            • zozbot234

              today at 12:10 PM

              But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.

      • hypfer

        today at 8:18 AM

        TL;DR (and please correct me if I got it wrong):

        Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram. The other way round then just predicts the values again, applies the delta, and you have the full correct value while just storing the delta

        And this works because you're never looking at the whole k/v cache but always just a slice. So you just need a memory buffer of the size of the slice

        ___

        If this works out and I've understood correctly, that _I think_ would mean that a 24GB RTX 4090 could fit 256k q8 context next to Qwen3.6-27B at IQ4_NL.

        Or, alternatively, something like 208k context (matching claude api limits of 200k in some plans) with a slightly larger quant like UD-Q4_K_XL.

        That would be massive. Especially since the thing has so much compute to spare.

        Though, all depending on the size of that predictor model I guess?

        • syllogistic

          today at 2:06 PM

          How do these results compare with the engram based approach from deepseek?

          • ssivark

            today at 11:04 AM

            Note that any cache (eg LRU-eviction) is just a specific speculative model for future usage :-)

            The cache can be backed by hardware/lookup, or by a cheap computation. The line between functions and data is really blurry.

              • mycall

                today at 12:28 PM

                Would you say it is homoiconic, similar to LISP where the syntax of the language is the AST; so, data can become code (Macros) and code can be data (the S-Expression)?

            • 0-_-0

              today at 9:02 AM

              You can use the original model to compress the kv cache and get ∞x compression, since the prediction is perfect. The cost is time, and I don't see how this could be worth it.

                • wongarsu

                  today at 11:04 AM

                  The tradeoff gets better the bigger your primary model, and probably with bigger batch sizes. The KV cache can consume a lot of expensive VRAM, and the VRAM and compute costs of the predictor model become a small fraction of the cost of the primary model

                  For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so

                    • 0-_-0

                      today at 2:01 PM

                      This can't be used to save VRAM in practice. To generate a new token with the primary model, you first need to decompress the cache, which involves regenerating the whole sequence from scratch. I.e. generate 1 million tokens with the small model to generate 1 with the large.

              • monster_truck

                today at 10:41 AM

                There is no compression taking place here.

                  • liuliu

                    today at 4:18 PM

                    It is a “research note”. It might not pan out, and you might say it doesn’t deserve the attention on the internet. But it did suggest something that resembles of compression, just no experiment done for that.

                    • zzzoom

                      today at 4:13 PM

                      Isn't the delta fed to an arithmetic coder?

                      • boutell

                        today at 1:48 PM

                        Isn't that nitpicking? It's a smaller representation of the data, if you have a certain appetite for decompression time. It could conceivably be worth it. I think it would make a great level 2 cache for older chats.

                    • mirekrusin

                      today at 8:38 AM

                      If “speculative” approach works so well in different contexts why not make it first class and use everywhere, possibly recursively?

                        • saagarjha

                          today at 10:55 AM

                          Speculation is only worth it if you can profit from it. Not every context allows this or has a similar idea of what can be speculated.

                            • mirekrusin

                              today at 3:15 PM

                              It works very well on dense models, imho great alternative to MoE. As verification is cheaper than generation it could be fundamental, first class primitive, maybe even to recurse on it, do live distillation during inference etc.

                              MoE is more hardcoded, pre determined, speculation is much more dynamic, malleable after training.

                              This paper actually proposes direction of aligning architecture to aid speculation as future work.

                          • doctorpangloss

                            today at 5:10 PM

                            Multi-token prediction is a good enhancement to training. It isn't necessarily useful for inference. Other speculative decoding like EAGLE is. It is specific to the technology and the authors of these things write about it.

                        • haeseong

                          today at 12:16 PM

                          [dead]

                          • porridgeraisin

                            today at 8:38 AM

                            I am yet to do a "deep dive" into the results, but what a well written article. An LLM could _never_ write so crisply.