\

Show HN: Find the best local LLM for your hardware, ranked by benchmarks

121 points - today at 9:19 AM

Source
  • jordiburgos

    today at 11:12 AM

    This is very helpful too: https://www.canirun.ai/

      • freeCandy

        today at 12:02 PM

        Every browser gives me a different result, I guess I can't blame the site for that. But it should perhaps mention which browser would be the most accurate.

        • embedding-shape

          today at 11:57 AM

          Love that it defaults to the GPU being "NVIDIA GeForce 8800 GTX", a GPU released in 2006 with ~700MB of VRAM...

          The estimates seems far off as well, took https://www.canirun.ai/model/gpt-oss-120b as an example, with a RTX Pro 6000 and every single number is off.

      • pornel

        today at 11:01 AM

        It looks nice. I've been searching for something like this recently, and was frustrated with rankings that lack latest models or don't clearly distinguish quantizations.

        Showing quality loss per quantization is nice.

        I'd prefer this as a website, since I'd handle running of the model with a dedicated inference server anyway.

        It would be nice to see what's the maximum context length that can fit on top of the baseline.

        I was surprised how much token generation speed tanks when using very long context. 30/s can drop down to 2/s. A single speed metric didn't prepare me for that.

        I was also positively surprised that some models scale well with batch parallelism. I can get 4x speed improvement by running 8 requests in parallel. But this affects memory requirements, and doesn't apply to all models and inference engines. It would be nice to show that. Some sites fold it into "what's your workflow", but that's too opaque.

        KV cache quantization also makes a difference for speed, VRAM usage and max usable context.

        On Apple Silicon MLX-compatible model builds make a difference, so I'd like to see benchmarks reassure they're based on the fastest implementation.

        Multi-token-prediction is another aspect that may substantially change speed.

        • Bigsy

          today at 10:41 AM

          Brew install is broken

          It seems pretty rubbish I have to say, its recommending me loads of qwen 2.5 which are really old and I'm easy running qwen3.5 and 3.6 models on this mac at decent quants

            • vachina

              today at 11:49 AM

              AI slop quality software for ya.

              “I release software now, good luck everyone”

          • sleepyeldrazi

            today at 11:03 AM

            I love this community, I started building a simple website for this exactly a couple of hours ago and you made an even more advanced version already. Hats off to you sir.

            If i ever decide to actually publish the site, is it alright if I mention you somewhere as a "If you want a more accurate estimation, check out this project:<your repo>", as i think there is value in having a simple website estimate this information for you, and give you instructions/ common flags on how to start it yourself (also a prompt crafted for you to optionally give to an llm to set it up for you), but im going off simple "choose an os, gpu/vram, here's a list of options" and not actually scanning (which is a lot more accurate).

            • llagerlof

              today at 10:52 AM

              What’s new regarding llmfit?

              https://github.com/AlexsJones/llmfit

                • rvz

                  today at 10:55 AM

                  Other than it (whichllm) being written in Python, nothing else.

                  I just use llmfit.

              • Jasssss

                today at 10:27 AM

                The plan command is clever. How do you handle the VRAM estimation for models with sliding window attention vs full context? Something like Mistral at 32k context uses way less KV cache than Llama at the same context length, but from the README it looks like the estimation is based on a fixed context size. Does it account for that?

                • kramit1288

                  today at 10:45 AM

                  accurate memory estimation is key here. it will crash if that accurate and it cant be generic for all local llm. each local llm has different context estimates.

                  • cyanydeez

                    today at 11:42 AM

                    This doesn't correclty detect the unified memory architecture for

                    GPU 0: STRXLGEN — 8.0 GB (ROCm 6.19.8-200.fc43.x86_64) — BW: N/A CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S — 16 cores (AVX2, AVX-512)

                    The 8GB is the reserved memory, but it's not the total available memory to the GPU.

                    Linux sets the unified memory like this on linux: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

                    Don't feel bad though, nvtop doesn't do it correctly either.

                    • macwhisperer

                      today at 10:53 AM

                      can you add in the other quants like IQ3_M?

                      also my personal simple rule of thumb for local ai sizing is:

                      max model size (GB) = ram (GB) / 1.65

                      • pbronez

                        today at 11:16 AM

                        Cool, but it looks like it doesn’t actually test anything on your machine? It does hardware detection and then some lookups. Maybe I missed it but I really want a tool like this to actually run a model on my machine to get the speed numbers.

                        I’ve been using RapidMLX for this. The integrated speed tests matter because the quality of the backend is a moving target and the quantization / MLX format conversion also matter. It’s not enough to say “oh use this model family with X parameters” you have to add the architecture specific quantization too.

                        https://github.com/raullenchai/Rapid-MLX

                        • hacker_mar

                          today at 11:13 AM

                          [flagged]