Tiled Hacker news on React Router

Systematically generating tests that would have caught Anthropic's top‑K bug

80 points - 01/11/2026

Source

ludovicianul
01/14/2026
Fuzzing as a concept is heavily underused in routine testing. People will usually focus on positive flows and some obvious/typical negative ones. But it's almost impossible to have the time to write exhaustive testing to cover all negative and boundary scenarios. But the good news is, you don't actually have to. There are so many tools now that can almost exhaustively generate tests for you at all levels. The bad news, they are not so widely used.
bonoboTP
01/14/2026
Recently asked Claude Code how to do more thorough tests and described how I imagine it and it set up Hypothesis and mutmut testing. The latter is quite cool, it introduces bugs in the code like swapping values and relational operators and checks if any test catches the bug. If not, your tests are probably not thorough enough. Better than just line coverage checks.
moron4hire
01/14/2026
Would you call it a K-top Defect Hunter?
ebiederm
01/14/2026
I appreciate that I am not the only one seeing the connection between property based testing and proofs.
I will quibble a little with their characterization of proofs as being more computationally impractical.
Proof verification is cheap. On a good day it is as cheap as type checking. Type checking being a kind of proof verification. That said writing proofs can be tricky.
I am still figuring out what writing proofs requires. Anything beyond what your type system can express currently requires a different set of tools (Rocq, Lean, etc) than writing asserts and ordinary programs. Plus writing proofs tends to have lots of mundane details that can be tedious to write.
So while I agree proofs seem impractical. I won't agree the reason is computational cost.
Der_Einzige
01/14/2026
That’s anthropic fault for continuing to use top-K, a stoneage tier shitty sampler. Your own head of mechanistic interpretability invented a better one called tail free sampling in 2019.
stephantul
01/14/2026
Using the phrase "without the benefit of hindsight" is interesting. The hardest thing with any technology is knowing when to spend the effort/money on applying it. The real question is: do you want to spend your innovation tokens on things like this? If so, how many? And where?
Not knocking this, just saying that it is easy to claim improvements if you know there are improvements to be had.

Systematically generating tests that would have caught Anthropic's top‑K bug

ludovicianul

esafak

ludovicianul

esafak

ludovicianul

bonoboTP

aitchnyu

whattheheckheck

pfdietz

moron4hire

esafak

ebiederm

jasongross

UncleEntity

Der_Einzige

Majromax

Der_Einzige

stephantul

esafak

pfdietz