\

How an inference provider can prove they're not serving a quantized model

58 points - today at 6:53 AM

Source
  • hleszek

    today at 9:22 PM

    Why not allow the user to provide the seed used for the generation. That way at least we can detect if the model has changed if the same prompt with the same seed suddenly gives a new answer (assuming they don't cache answers), you could compare different providers which supposedly use the same model, and if the model is open-weight you could even compare yourself on your own hardware or on rented gpus.

      • bthornbury

        today at 9:27 PM

        AFAIK seed determinism can't really be relied upon between two machines, maybe not even between two different gpus.

          • whatsupdog

            today at 9:57 PM

            That doesn't seem correct. It's just matrix multiplications at the end. Doesn't matter if it's a different computer, GPU or even math on a napkin. Same seed, input and weights should give the same output. Please correct me if I'm wrong.

              • tripplyons

                today at 10:29 PM

                There are many ways to compute the same matrix multiplication that apply the sum reduction in different orders, which can produce different answers when using floating point values. This is because floating point addition is not truly associative because of rounding.

                • jashulma

                  today at 10:05 PM

                  https://thinkingmachines.ai/blog/defeating-nondeterminism-in... A nice write up explaining how it’s not as simple as it sounds

                  • measurablefunc

                    today at 10:10 PM

                    You're assuming consistent hardware & software profiles. The way these things work at scale is essentially a compiler/instruction scheduling problem where you can think of different CPU/GPU combinations as the pipelines for what is basically a data center scale computer. The function graph is broken up into parts, compiled for different hardware profiles w/ different kernels, & then deployed & stitched together to maximize hardware utilization while minimizing cost. Service providers are not doing this b/c they want to but b/c they want to be profitable so every hardware cycle that is not used for querying or optimization is basically wasted money.

                    You'll never get agreement from any major companies on your proposal b/c that would mean they'd have to provide a real SLA for all of their customers & they'll never agree to that.

                • maxilevi

                  today at 10:04 PM

                  thats not true in practice

                    • tripplyons

                      today at 10:33 PM

                      It is definitely true across different chips. The best kernel to use will vary with what chip it is running on, which often implies that the underlying operations will be executed in a different order. For example, with floating point addition, adding up the same values in a different order can return a different result because floating point addition is not associative due to rounding.

              • bthornbury

                today at 9:30 PM

                Something like a perplexity/log-likelihood measurement across a large enough number of prompts/tokens might get you the same in a statistical sense though. I expect those comparison percentages at the top are something like that.

            • wongarsu

              today at 8:40 PM

              I'm somehow more convinced by the method shown in the introduction of the article: run a number of evals across model providers, see how they compare. This also catches all other configuration changes an inference provider can make, like KV-cache quantization. And it's easy to understand, talk about, and the threat model is fairly clear (be wary of fixed answers to your benchmark if you're really distrustful)

              Of course conceptually attestation is neat and wastes less compute with repeated benchmarks. It definitely has its place

                • Aurornis

                  today at 9:25 PM

                  This comes up so frequent that I’ve seen at least 3-4 different websites running daily benchmarks on providers and plotting their performance.

                  The last one I bookmarked has already disappeared. I think they’re generally vibe coded by developers who think they’re going to prove something but then realize it’s expensive to spend that money on tokens every day.

                  They also use limited subsets of big benchmarks because to keep costs down, which increases the noise of the results. The last time someone linked to one of the sites claiming a decline in quality looked like a noisy mostly flat graph that someone had put a regression line on that was very slightly sloping downward.

              • viraptor

                today at 8:39 PM

                The title here seems very different from the post. All that verification happens locally only. There's no remote validation at any point. So I'm not sure what's the reason to even apply this check. If you're running the model yourself, you know what you're downloading and can check the hash once for transfer problems. Then you can do different things for preventing storage bitrot. But you're not proving anything to your users this way.

                You'd need to run a full, public system image with known attestation keys and return some kind of signed response with every request to do that. Which is not impossible, but the remote part seems to be completely missing from the description.

                  • FrasiertheLion

                    today at 9:06 PM

                    The verification is not happening locally only. The client SDKs fetch the measurement of the weights (+ system software, inference engine) that are pinned to Sigstore, then grabs the same measurement (aka remote attestation of the full, public system image) from the running enclave, and checks that the two are exactly equal. Our previous blog explains this in more detail: https://tinfoil.sh/blog/2025-01-13-how-tinfoil-builds-trust

                    Sorry it wasn’t clear from the post!

                      • arboles

                        today at 9:34 PM

                        What prevents the provider from sending to the client an attestation of hardware state and actually running another?

                          • viraptor

                            today at 9:55 PM

                            The other comments are correct, but let me try for a different phrasing, because it's a complex topic. You have two parts for attestation: The hardware provides the keys and computation for the measurement state that you can't change as a user. The software provides the extra information/measurements to the hardware.

                            That means you can't simulate the hardware in a way that would allow you to cheat (the keys/method won't match). And you can't replace the software part (the measurements won't match).

                            It all depends on the third party and the hardware keys not leaking, but at long as you can review the software part, you can be sure the validation of the value sent with the response is enough.

                              • arboles

                                today at 10:08 PM

                                I understand hardware attestation at this level, it's why you couldn't route a hardware attestation from a different machine, that's not the one the user cares about, that I'm working on understanding.

                                  • viraptor

                                    today at 10:34 PM

                                    Because to obtain the result of attestation, you'd need to actually run the prompt on the verified machine in the first place. (And in practice the signature would be bound to your response as well)

                                    • 3s

                                      today at 10:17 PM

                                      The attestation is tied to the Modelwrap root hash (the root hash is included in the attestation report) so you know that the machine that is serving the model has the right model weights

                              • FrasiertheLion

                                today at 9:46 PM

                                When the enclave boots, two things happen:

                                1. An HPKE (https://www.rfc-editor.org/rfc/rfc9180.html ) key is generated. This is the key that encrypts communication to the model.

                                2. The enclave is provisioned a certificate

                                The certificate is embedded with the HPKE key accessible only inside the enclave. The code for all this is open source and part of the measurement that is being checked against by the client.

                                So if the provider attempts to send a different attestation or even route to a different enclave, this client side check would fail.

                                  • arboles

                                    today at 10:36 PM

                                    Is this certificate a TLS certificate? At least the TLS connection the user has should be with the "enclave", not a proxy server. If the connection is with a proxy server, the user can be MITM'd.

                                • julesdrean

                                  today at 9:41 PM

                                  The provider cannot chose the attestation that is sent, the hardware assembles the attestation through mechanisms that it cannot control. That why it's called "trusted hardware" technology, you only need to trust the hardware (how it was implemented), and you don't need to trust the provider operating it.

                      • robrenaud

                        today at 10:36 PM

                        Please serve well quantized models.

                        If you can get 99 percent of the quality for 50 percent of the cost, that is most times a good tradeoff.

                        • bthornbury

                          today at 9:23 PM

                          Is modelwrap running on arbitrary clients? I'm not following the whole post, but how are you able to maintain confidence in client-owned hardware/disks following the secure model the method seems to depdend on?

                            • FrasiertheLion

                              today at 9:53 PM

                              The disk isn’t client owned, but anyone can run modelwrap on any device and reproduce the root measurement that is being attested against.

                          • arcanemachiner

                            today at 8:22 PM

                            Call me an old fuddy-duddy, but my faith in the quality of your reporting really fell through the floor when I saw that the first image showed Spongebob Squarepants swearing at the worst-performing numbers.

                            EDUT: I read through the article, and it's a little over my head, but I'm intrigued. Does this actually work?

                            • rhodey

                              today at 8:35 PM

                              In my opinion this is very well written

                              Two comments so far suggesting otherwise and I guess idk what their deal is

                              Attestation is taking off

                                • today at 9:27 PM

                              • LoganDark

                                today at 9:22 PM

                                I don't understand what stops an inference provider from giving you a hash of whatever they want. None of this proves that's what they're running, it only proves they know the correct answer. I can know the correct answer all I want, and then just do something different.

                                  • rhodey

                                    today at 9:31 PM

                                    Attestation always involves a "document" or a "quote" (two names for basically a byte buffer) and a signature from someone. Intel SGX & TDX => signature from intel. AMD SEV => signature from amd. AWS Nitro Enclaves => signature from aws.

                                    Clients who want to talk to a service which has attestation send a nonce, and get back a doc with the nonce in it, and the clients have somewhere in them a hard coded certificate from Intel, AMD, AWS and they check that the doc has a good sig.

                                      • LoganDark

                                        today at 9:36 PM

                                        Yes, though I see the term abused often enough that it's not enough for me to believe it's sound just from the use of the term attestation. Nowadays "attestation" is simply slang for "validate we can trust [something]". I didn't see any mechanism described in the article to validate that the weights actually being used are the same as the weights that were hashed.

                                        In a real attestation scheme you would do something like have the attesting device generate a hardware-backed key to be used for communications to and from it, to ensure it is not possible to use an attestation of one device to authenticate any other device or a man-in-the-middle. Usually for these devices you can verify the integrity of the hardware-backed key as well. Of course all of this is moot though if you can trick an authorized device into signing or encrypting/decrypting anything attacker-provided, which is where many systems fail.

                                    • FrasiertheLion

                                      today at 9:28 PM

                                      There’s a few components that are necessary to make it work:

                                      1. The provider open sources the code running in the enclave and pins the measurement to a transparency log such as Sigstore

                                      2. On each connection, the client SDK fetches the measurement of the code actually running (through a process known as remote attestation)

                                      3. The client checks that the measurement that the provider claimed to be running exactly matches the one fetched at runtime.

                                      We explain this more in a previous blog: https://tinfoil.sh/blog/2025-01-13-how-tinfoil-builds-trust

                                        • LoganDark

                                          today at 9:29 PM

                                          What enclave are you using? Is it hardware-backed?

                                          Edit: I found https://github.com/tinfoilsh/cvmimage which says AMD SEV-SNP / Intel TDX, which seems almost trustworthy.

                                            • FrasiertheLion

                                              today at 9:31 PM

                                              Yes, we use Intel TDX/AMD SEV-SNP with H200/B200 GPUs configured to run in Nvidia Confidential Computing mode

                                                • LoganDark

                                                  today at 9:34 PM

                                                  I would be interested to see Apple Silicon in the future, given its much stronger isolation and integrity guarantees. But that is an entirely different tech stack.

                                                    • julesdrean

                                                      today at 9:48 PM

                                                      Apple does something very similar with Apple Private Cloud Compute. It's interesting cause their isolation argument is different. For instance, memory is not encrypted (so weaker protection against physical attacks), but they measure and guarantee integrity (and need to trust) all code running on the machine, not just inside the secure enclave.

                                                      Good question is how many lines of code do you need to trust at the end of the day between these different designs.

                                                        • LoganDark

                                                          today at 10:44 PM

                                                          Lines of code hardly means anything, but I'd believe Apple has far fewer, given how aggressively they curtail their platforms rather than letting them collect legacy cruft.

                                  • jMyles

                                    today at 9:42 PM

                                    Related but distinct: Is there an ELI5 about determinism in inference? In other words, when will the same prompt lead to the same output, and when not? And why not?

                                      • FrasiertheLion

                                        today at 10:21 PM

                                        jashulma above has a great link: https://news.ycombinator.com/item?id=47105315

                                        • measurablefunc

                                          today at 10:26 PM

                                          Even if you reduce all the non-determinism you still will not get consistent results b/c of floating point rounding & instruction scheduling in the GPU. There is no way to guarantee that the GPU pipelines will execute your instructions exactly in the order you want it to be executed b/c GPUs are now essentially equivalent to sufficiently smart compilers & perform all sorts of clever instruction re-ordering behind the scenes. Expecting complete reproducibility at scale is a pipe dream.

                                      • exceptione

                                        today at 8:27 PM

                                        The idea is that you run a workload at a model provider, that might cheat on you by altering the model they offer, right? So how does this help? If the provider wants to cheat (they apparently do), wouldn't they be able to swap the modelwrap container, or maybe even do some shenanigans with the filesystem?

                                        I am ignorant about this ecosystem, so I might be missing something obvious.

                                          • FrasiertheLion

                                            today at 9:12 PM

                                            The committed weights are open source and pinned to a transparency log, along with the full system image running in the enclave.

                                            At runtime, the client SDK (also open source: https://docs.tinfoil.sh/sdk/overview) fetches the pinned measurement from Sigstore, and compares it to the attestation from the running enclave, and checks that they’re equal. This previous blog explains it in more detail: https://tinfoil.sh/blog/2025-01-13-how-tinfoil-builds-trust

                                        • 45dsilicon

                                          today at 7:56 PM

                                          [dead]

                                          • cmrx64

                                            today at 9:45 PM

                                            https://hellas.ai is building out their category theoretic compiler and protocol for solving this issue

                                              • tripplyons

                                                today at 10:39 PM

                                                ZKML is a very exciting emerging field, but the math is no where near efficient enough to prove an inference result for an LLM yet. They are probably just trying to sell their crypto token.