\

Toward automated verification of unreviewed AI-generated code

47 points - yesterday at 10:52 AM

Source
  • artee_49

    today at 8:29 PM

    Unintended side-effects are the biggest problems with AI generated code. I can't think of a proper way to solve that.

    • phailhaus

      today at 5:56 PM

      Using FizzBuzz as your proxy for "unreviewed code" is extremely misleading. It has practically no complexity, it's completely self-contained and easy to verify. In any codebase of even modest complexity, the challenge shifts from "does this produce the correct outputs" to "is this going to let me grow the way I need it to in the future" and thornier questions like "does this have the performance characteristics that I need".

        • loloquwowndueo

          today at 6:16 PM

          > is this going to let me grow the way I need it to in the future

          This doesn’t matter in the age of AI - when you get a new requirement just tell the AI to fulfill it and the old requirements (perhaps backed by a decent test suite?) and let it figure out the details, up to and including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements.

          For performance, give the AI a benchmark and let it figure it out as well. You can create teams of agents each coming up with an implementation and killing the ones that don’t make the cut.

          Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding

            • lelanthran

              today at 8:03 PM

              > Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding

              You may think you are being sarcastic, but I guarantee that a significant percentage of developers think that both the following are true:

              a) They will never need to write code again, and

              b) They are some special snowflake that will still remain employed.

                • patates

                  today at 8:23 PM

                  I don't agree with your first point. We are surely writing less code, and it will keep getting less and less. At some point it will reduce to a single run function that will make the universe and everything work and it will be called via a button, and that will be the modern definition of writing code: Click the button. Not a lot of keys with weird alphabet thingies on them.

                  You are however right on your second point because I'm damn good at clicking buttons.

              • Swizec

                today at 6:27 PM

                > including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements

                Let me guess, you've never worked in a real production environment?

                When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.

                Really good post about why AI is a poor fit in software environments where nobody even knows the full requirements: https://www.linkedin.com/pulse/production-telemetry-spec-sur...

                  • dctoedt

                    today at 8:03 PM

                    > Let me guess, you've never worked in a real production environment?

                    The comment to which you're responding includes a note at the end that the commenter is being sarcastic. Perhaps that wasn't in the comment when you responded to it.

                      • Swizec

                        today at 8:25 PM

                        It wasn’t thanks for highlighting. Can be hard to tell online because there’s a lot of people genuinely suggesting everyone should build their own software on the fly

                    • person22

                      today at 6:39 PM

                      I work on a product that meets your criteria. We can't fix a class of defects because once we ship, customers will depend upon that behavior and changing is very expensive and takes years to deprecate and age out. So we are stuck with what we ship and need to be very careful about what we release.

                        • fhd2

                          today at 6:55 PM

                          That's why I find any effort to create specifications... cute. In brownfield software, more often than not, the code _is_ the specification.

                          • patates

                            today at 8:29 PM

                            This might be the "Steve, Don't Eat It!" version of the xkcd workflow comic.

                            Whatever you ship, steve will eat, and some steves will develop an addiction.

                        • empath75

                          today at 7:03 PM

                          > When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.

                          Well, now it'll take them 5 minutes to rewrite their code to work around your change.

                            • Swizec

                              today at 7:16 PM

                              > Well, now it'll take them 5 minutes to rewrite their code to work around your change

                              You misunderstand. It will take them 2 years to retrain 5000 people on the new process across hundreds of locations. In some fields, whole new college-level certifications courses will have to be created.

                              In my specific experience it’s just a few dozen (maybe 100) people doing the manual process on top of our software and it takes weeks for everyone to get used to any significant change.

                              We still have people using pages that we deprecated a year ago. Nobody can figure out who they are or what they’re missing on the new pages we built

                              • procaryote

                                today at 7:14 PM

                                That will be after it broke, which costs money

                                Also: no

                        • baq

                          today at 6:26 PM

                          it isn't gospel, it's perspective. if you care about the code, it's obviously bonkers. if you care about the product... code doesn't matter - it's just a means to an end. there's an intersection of both views in places where code actually is the product - the foundational building blocks of today's computing software infrastructure like kernels, low level libraries, cryptography, etc. - but your typical 'uber for cat pictures' saas business cares about none of this.

                          • today at 6:22 PM

                        • builtbyzac

                          today at 7:24 PM

                          The revenue-from-cold-start problem is harder than most AI posts acknowledge. I built products and ran distribution for 72 hours with a Claude Code agent — zero sales. Not because the agent couldn't do the work, but because the work (audience, credibility, trust) takes longer than 72 hours. The AI capability problem is mostly solved; the distribution and trust problem isn't.

                            • wordpad

                              today at 7:31 PM

                              > AI capability problem is mostly solved; the distribution and trust problem isn't.

                              SaaS opportunity? Maybe, some sort of marketplace of AI-written applications and services with discovery features?

                      • jryio

                        today at 6:08 PM

                        This is a naĂŻve approach, not just because it uses FizzBuzz, but because it ignores the fundamental complexity of software as a system of abstractions. Testing often involves understanding these abstractions and testing for/against them.

                        For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.

                        If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.

                          • spaceywilly

                            today at 8:05 PM

                            Exactly. The challenge isn’t getting the LLMs to make sure they validate their own code. It’s getting the LLMs to write the correct code in the first place. Adding more and more LLM-generated test code just obfuscates the LLM code even further. I have seen some really wild things where LLM jumps through hoops to get tests to pass, even when they actually should be failing because the logic is wrong.

                            The core of the issue is that LLMs are sycophants, they want to make the user happy above all. The most important thing is to make sure what you are asking the LLM to do is correct from the beginning. I’ve found the highest value activity is the in the planning phase.

                            When I have gotten good results with Claude Code, it’s because I spent a lot of time working with it to generate a detailed plan of what I wanted to build. Then by the time it got to the coding step, actually writing the code is trivial because the details have all been worked out in the plan.

                            It’s probably not a coincidence that when I have worked in safety critical software (DO-178), the process looks very similar. By the time you write a line of code, the requirements for that line have been so thoroughly vetted that writing the code feels like an afterthought.

                        • agentultra

                          today at 8:22 PM

                          This might work on small, self contained projects.

                          No side effects is a hefty constraint.

                          Systems tend to have multiple processes all using side effects. There are global properties of the system that need specification and tests are hard to write for these situations. Especially when they are temporal properties that you care about (eg: if we enter the A state then eventually we must enter the B state).

                          When such guarantees involve multiple processes, even property tests aren’t going to cover you sufficiently.

                          Worse, when it falls over at 3am and you’ve never read the code… is the plan to vibe code a big fix right there? Will you also remember to modify the specifications first?

                          Good on the author for trying. Correctness is hard.

                          • tedivm

                            today at 5:34 PM

                            While I understand why people want to skip code reviews, I think it is an absolute mistake at this point in time. I think AI coding assistants are great, but I've seen them fail or go down the wrong path enough times (even with things like spec driven development) where I don't think it's reasonable to not review code. Everything from development paths in production code, improper implementations, security risks: all of those are just as likely to happen with an AI as a Human, and any team that let's humans push to production without a review would absolutely be ridiculed for it.

                            Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.

                              • Lerc

                                today at 6:12 PM

                                I agree that it would be a mistake to use something like this in something where people depend upon specific behaviour of the software. The only way we will get to the point where we can do this is by building things that don't quite work and then start fixing the problems. Like AI models themselves, where they fail is on problems that they couldn't even begin to attempt a short time ago. That loses track of the fact that we are still developing this technology. Premature deployment will always be fighting against people seeking a first mover advantage. People need to stay aware of that without critisising the field itself.

                                There are a subset of things that it would be ok to do this right now. Instances where the cost of utter failure is relatively low. For visual results the benchmark is often 'does it look right?' rather than 'Is it strictly accurate?"

                            • wei03288

                              today at 8:07 PM

                              The core insight here is right: AI-generated code needs different review approaches than human-written code. Humans make predictable mistakes (typos, off-by-one, forgetting edge cases). AI makes unpredictable mistakes that look plausible on the surface.

                              The trickiest part in my experience isn't catching obviously wrong code - it's catching code that works for all the common cases but fails on edge cases the AI never considered. Traditional code review catches these through reviewer experience, but automated verification needs to somehow enumerate edge cases it hasn't seen.

                              Property-based testing (Hypothesis for Python, fast-check for JS) is probably the closest existing tool for this. Instead of writing specific test cases, you describe properties the code should satisfy, and the framework generates thousands of random inputs. It's remarkably good at finding the kind of boundary bugs AI tends to produce.

                              • jghn

                                today at 5:23 PM

                                I do think that GenAI will lead to a rise in mutation testing, property testing, and fuzzing. But it's worth people keeping in mind that there are reasons why these aren't already ubiquitous. Among other issues, they can be computationally expensive, especially mutation testing.

                                • pron

                                  today at 6:38 PM

                                  > The code must pass property-based tests

                                  Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.

                                  There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's failed attempt to have agents write a C compiler (not a trivial task, but far from an exceptionally difficult one). They had thousands of good human-written tests, but the agents couldn't get the software to converge. They fixed one bug only to create another.

                                  • sharkjacobs

                                    today at 6:01 PM

                                    I'm having a hard time wrapping my head around how this can scale beyond trivial programs like simplified FizzBuzz.

                                      • hrmtst93837

                                        today at 6:25 PM

                                        People treating this as a scaling problem are skipping the part where verification runs into undecidability fast.

                                        Proving a small pure function is one thing, but once the code touches syscalls, a stateful network protocol, time, randomness, or messy I/O semantics, the work shifts from 'verify the program' to 'model the world well enough that the proof means anything,' and that is where the wheels come off.

                                    • duskdozer

                                      today at 6:31 PM

                                      So are we finally past the stage where people pretend they're actually reading any of the code their LLMs are dumping out?

                                        • fhd2

                                          today at 7:01 PM

                                          Who's "we"?

                                          I'd consider shipping LLM generated code without review risky. Far riskier than shipping human-generated code without review.

                                          But it's arguably faster in the short run. Also cheaper.

                                          So we have a risk vs speed to market / near term cost situation. Or in other words, a risk vs gain situation.

                                          If you want higher gains, you typically accept more risk. Technically it's a weird decision to ship something that might break, that you don't understand. But depending on the business making that decision, their situation and strategy, it can absolutely make sense.

                                          How to balance revenue, costs and risks is pretty much what companies do. So that's how I think about this kind of stuff. Is it a stupid risk to take for questionable gains in most situations? I'd say so. But it's not my call, and I don't have all the information. I can imagine it making sense for some.

                                          • empath75

                                            today at 7:04 PM

                                            In a year people will be complaining about human written code going into production without LLM review.

                                        • jerf

                                          today at 7:57 PM

                                          "However, I'm starting to think that maintainability and readability aren't relevant in this context. We should treat the output like compiled code."

                                          I would like to put my marker out here as vigorously disagreeing with this. I will quote my post [1] again, which given that this is the third time I've referred to a footnote via link rather suggests this should be lifted out of the footnote:

                                          "It has been lost in AI money-grabbing frenzy but a few years ago we were talking a lot about AIs being “legible”, that they could explain their actions in human-comprehensible terms. “Running code we can examine” is the highest grade of legibility any AI system has produced to date. We should not give that away.

                                          "We will, of course. The Number Must Go Up. We aren’t very good at this sort of thinking.

                                          "But we shouldn’t."

                                          Do not let go of human-readable code. Ask me 20 years ago whether we'd get "unreadable code generation" or "readable code generation" out of AIs and I would have guessed they'd generate completely opaque and unreadable code. Good news! I would have been completely wrong! They in fact produce perfectly readable code. It may be perfectly readable "slop" sometimes, but the slop-ness is a separate issue. Even the slop is still perfectly readable. Don't let go of it.

                                          [1]: https://jerf.org/iri/post/2026/what_value_code_in_ai_era/

                                          • davemp

                                            today at 7:35 PM

                                            So often these AI articles mis or ignore the Test Oracle Problem. Generating correct tests is at least as hard as generating the correct answers (often harder).

                                            I’m actually starting to get annoyed about how much material is getting spread around about software analysis / formal methods by folks ignorant about the basics of the field.

                                            • today at 6:15 PM

                                              • boombapoom

                                                today at 7:00 PM

                                                production ready "fizz buzz" code. lol. I can't even continue typing this response.

                                                • Ancalagon

                                                  today at 5:44 PM

                                                  Even with mutation testing doesn’t this still require review of the testing code?

                                                    • Animats

                                                      today at 6:17 PM

                                                      Mutation is a test for the test suite. The question is whether a change to the program is detected by the tests. If it's not, the test suite lacks coverage. That's a high standard for test suites, and requires heavy testing of the obvious.

                                                      But if you actually can specify what the program is supposed to do, this can work. It's appropriate where the task is hard to do but easy to specify. A file system or a database can be specified in terms of large arrays. Most of the complexity of a file system is in performance and reliability. What it's supposed to do from the API perspective isn't that complicated. The same can be said for garbage collectors, databases, and other complex systems that do something that's conceptually simple but hard to do right.

                                                      Probably not going to help with a web page user interface. If you had a spec for what it was supposed to do, you'd have the design.

                                                      • jryio

                                                        today at 6:11 PM

                                                        Correct. Where did the engineering go? First it was in code files. Then it went to prompts, followed by context, and then agent harnesses. I think the engineering has gone into architecture and testing now.

                                                        We are simply shuffling cognitive and entropic complexity around and calling it intelligence. As you said, at the end of the day the engineer - like the pilot - is ultimately the responsible party at all stages of the journey.

                                                    • otabdeveloper4

                                                      today at 6:37 PM

                                                      This one is pretty easy!

                                                      Just write your business requirements in a clear, unambiguous and exhaustive manner using a formal specification language.

                                                      Bam, no coding required.

                                                      • morpheos137

                                                        today at 6:55 PM

                                                        I think we need to approach provable code.

                                                        • andai

                                                          today at 6:25 PM

                                                          ...in FizzBuzz

                                                          • ventana

                                                            today at 6:29 PM

                                                            I might be missing the point of the article, but from what I understand, the TL;DR is, "cover your code with tests", be it unit tests, functional tests, or mutants.

                                                            Each of these approaches is just fine and widely used, and none of them can be called "automated verification", which, if my understanding of the term is correct, is more about mathematical proof that the program works as expected.

                                                            The article mostly talks about automatic test generation.

                                                            • aplomb1026

                                                              today at 5:31 PM

                                                              [dead]

                                                              • phillipclapham

                                                                today at 7:23 PM

                                                                [flagged]

                                                                • rigorclaw

                                                                  today at 6:50 PM

                                                                  [flagged]

                                                                    • ossianericson

                                                                      today at 7:28 PM

                                                                      [flagged]