> Regardless of what model you use, agentic coding tools are indeed pretty good at finding issues if you target them a bit. And they have no respect for their own code or any sense of shame. So, you can just point them at their own code with a new thread. Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that.
That's more or less all of them, they do just generate the likely combinations of tokens, there is no critical thought involved. If you want to approximate that, review iterations are probably the right way to go about it, without the full conversation context either so there's no model output like "I'm doing X because it seems like the correct way to go about Y." but rather a fresh context which allows for more critical predictions.
Here's what works for me, can be made into a skill in whatever you use:
I would like you to do a review loop!
How this works:
* once implementation is done, all tools must be run and pass: whatever is configured in the project like Ruff, Oxlint and Oxfmt, depending on the tech stack (also don't run such tools directly, look at package.json or similar project files/configurations/run scripts first; like if it's a stack that has compilation, compile the app, if there are tests, then run those; just know that you DO NOT generally need to stand up the whole app); if there is a projectlint-rules folder then that means you probably should run ProjectLint as well (local tool, use projectlint --help or projectlint --docs, or better yet, look at whether package.json or README.md have any instructions on how to run it)
* once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code (not each having a different sub-section) and looking for CRITICAL/SERIOUS issues (not nitpicks), with the goal of not missing anything and building consensus
* whatever CRITICAL/SERIOUS issues are found, if you can confirm that they're real and not false positives, you will then fix and remember to run the tools after, after which you will do another review iteration, followed by a fix iteration if needed and so on
* remember that the review and fix loop must END with an iteration of the review agents returning that there are no CRITICAL/SERIOUS issues - you cannot just do fixes and say that there is nothing remaining yourself (and also remember that the reviews are done when all of the tools pass, like when the code is linted and formatted etc.)
* at the end, produce a summary post that has a table, the rows being iterations, the columns for each of the agents (A, B, C) showing FIX/OK and then a column called Iteration summary; the goal for this is to show a summary how many iterations it took and what was fixed, you can also include text alongside the table as normally
The ProjectLint references might need to be removed (replace with whatever higher level linting/architecture tools you have, if any), but that's the overall idea. It does use a LOT of tokens though, but almost always there's something to fix. Of course, the problem is that sometimes there will be nitpicks or the fixes themselves won't be fully okay, though in general this trends towards slightly better code, even with something like Opus 4.7.
jillesvangurp
today at 9:21 AM
This can backfire a bit on token usage where it gets a bit to trigger happy running expensive things for trivial changes. I tend to not use sub agents for this reason. I actually manage to cover most my needs on the 20$/month codex subscription. I might switch to the 200$ plan at some point. But right now I just need to be economical as our company is fairy resource constrained. That's also why I prefer Codex over Claude Code. It seems it gets the job done for less $. Another advantage is that it seems to have less need to have things like this spelled out in this level of detail.
Another thing is that unless you are doing really complicated stuff, you probably don't need the latest models running on high. I'm still on 5.4 medium with codex. I see very little reason to change that.
Part of agentic engineering is figuring out how to be economical with tokens and time. You can sacrifice one for the other of course. But there are diminishing returns as well where spending 10x more doesn't actually get you 10x more quality/results.
KronisLV
today at 11:45 AM
I just have the Anthropic 100 USD Max plan and it's enough for daily work - I sometimes do hit the 5 hour limits by mid day, but weekly ones usually cap out at around 80% or thereabout, even with this approach. I usually use xhigh, sometimes max, both still result in situations where I need to intervene plenty, not even on that complex use cases (some LLM stuff, mostly web based CRUD, some light data processing, integrations with Jira and GitLab, processing PDFs and so on, sometimes ML training and geospatial work, like the Sentinel-2 satellite data, nothing crazy).
If I had to pay per token, I'd probably look at DeepSeek. In general it feels like it's a bit early for the technology - either our software methods are wasteful, or the hardware hasn't caught up. To me, it appears that we often need to throw more tokens at these problems, not less, since otherwise it's just one-shot slop.
> once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code
I did some evals with a prompt like this when I had some subscription tokens to burn, a few months ago. I think using Opus 4.5. What I found was:
1. Running two subagents was somewhat useful
2. Running three started to get redundant
3. Any more than three was pointless (at least when using the same model)
However, even two were getting like 60% the same results.
Much, much more effective was splitting out into audits through different lenses:
* One looking for security issues
* One looking for whether the task was completed successfully
* One looking for performance issues
* One looking for contract/maintainability issues
* One looking at test coverage
Etc.
KronisLV
today at 11:47 AM
You can get reasonably close with fewer, however more agents give better signal: e.g. if 3/3 flag something as an issue, the outer one that orchestrates them can view it as something to give more attention to, whereas if it's just 1/3, then it probably begs more consideration. Ofc more doesn't always imply right.