\

KVarN: Native vLLM backend for KV-cache quantization by Huawei

66 points - today at 3:18 PM

Source
  • throwa356262

    today at 3:54 PM

    Better performance than TQ and better quality than FP16?

    Am I reading this right??

      • qeternity

        today at 5:04 PM

        It's not better quality: 59.3% vs 59.4% fp16 on AIME 25

        • thefox96

          today at 5:02 PM

          Faster than Fp16, not better quality i guess

          • pbich

            today at 4:55 PM

            [dead]

        • v3ss0n

          today at 3:53 PM

          Why this is not a PR for vLLM ?

            • esafak

              today at 4:00 PM

              It's the output of a research paper; the authors are not trying to build up vLLM, and they probably have no incentive to do so. You can submit a PR, though! It's easier now while the divergence is low, so don't wait. Since there are six authors, I bet you could get help with the inevitable review chores if you just take the step of creating the PR.

              edit: It might not be clear that it is based on vLLM 0.22, which is the current version: https://github.com/huawei-csl/KVarN/commit/d6290e99098d7426d.... All you have to do is create a diff off it; it's fairly straightforward.

                • jmalicki

                  today at 4:14 PM

                  And with the help of AI, pointing at AI at this paper and saying "making a vLLM PR from this paper" tends to work surprisingly well, even if you need to nudge it a little bit along the way.

              • thefox96

                today at 5:28 PM

                it should be easy to do btw

            • shockembopper

              today at 5:17 PM

              [dead]