Tiled Hacker news on React Router

Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

127 points - today at 12:55 PM

Source

jdlshore
today at 2:49 PM
“Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”
One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.
siliconc0w
today at 8:09 PM
I recommend spending some time getting a few parts of the codebase idiomatic and then @-ing those files as exemplars. This works a lot better than trying to steer it with markdown. This works reasonably well for like FastAPI but JavaScript seems to be the worst, even with guidance and exemplars it'll prefer in-lining a bunch of garbage rather than use the APIs as directed.
KronisLV
today at 8:12 PM
> For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.
Time to start writing linting tools that check the architecture and spoon feed the LLM what exactly it's doing wrong.
I reckon something like this would be good for every project out there: https://www.archunit.org/getting-started
They expand a bit more on the reasoning behind it: https://www.archunit.org/motivation
(I also wrote a simple linter for architecture/code checks that aren't well encapsulated by ones that just focus on individual files, that uses Go + goja to write rules in ECMAScript and parallelize the read only ones and also allow ones that change files as necessary, in addition to something like Ruff / Oxlint / Oxfmt / whatever is present in each stack; though it's is still in development and not as good of a focused example as ArchUnit is)
If we write software specification docs, bother describing how it evolves with ADRs, enforce code style automatically and require certain test coverage automatically (or at least should), why couldn't we go a step further, formalize those specs and ensure that any new code is also up to snuff? I don't think that's any more of a job for an LLM, than telling it how it should format code is. Also, I'm in the camp that believes that at least many of your ORM mappings and similar stuff should be the output of codegen, since you've already gone through the trouble of describing the schema/migrations to get there.
I don't think this would be only good for LLMs, though - I've seen projects that have like 3 different audit systems built in, not because of some fancy business requirement, but rather cause the devs either didn't know about the previous one(s) or just didn't feel like following what should have been the pre-established conventions, even when there were docs in place (nobody read those).
maxbond
today at 2:46 PM
Reminds me of the recent paper about delegating document editing tasks to LLMs across different disciplines [1]. That paper found that programming was the only discipline most LLMs can perform long horizon tasks on without accumulating errors & corrupting the document.
I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.
[1] https://arxiv.org/abs/2604.15597
Discussion: https://news.ycombinator.com/item?id=48073246
vishvananda
today at 5:20 PM
I've been experimenting quite a bit with long-horizion agentic coding[1] and I have also noticed that agents seem to perform worse when forced into certain architectural patterns. I have found that is a bit better when including the constraints along the way instead of adding them after the fact. There seems to be a side-effect I have been calling "calcification", where a pattern starts appearing in the codebase and the agent follows the pattern to the point where it dominates the context and becomes self-reinforcing. This could potentially be a strength or a weakness for existing code bases depending the codebase quality. I will have more insights on this soon as more from-scratch runs conclude that include architectural guidance from the beginning.
[1]: https://medium.com/@vishvananda/i-spent-2-billion-tokens-wri...
dwa3592
today at 3:58 PM
This sounds like another version of "As a chat becomes longer, the guardrails seem to become fuzzy". You can't use all of the context window bc at the end, the output would not respect the constraints (or guardrails) but to reliably produce production grade code you want the model to have expansive awareness which fills up the context window pretty quickly. It's like saying "Keep everything in mind from these 6 directories - and make this <insert ticket> change" - but keeping everything in mind already fills it's context window which makes it lose it's ability to follow the constraints (or guardrails).
p0w3n3d
today at 3:35 PM
```
   tasks spanning eight web frameworks
```
Does anyone else have this experience that LLM create better pure html+CSS+js than work with existing frameworks?
rrook
today at 7:08 PM
As a codebase grows, divergent structural emergence from incidental(lang and lib) details results in prolonged complexity costs. I'm working on a language that enforces structure for agents: https://github.com/hale-lang/hale
pianopatrick
today at 7:08 PM
I think someone is going to figure out a framework for using LLMs for coding.
A framework would use static code checking tools to force an architecture on to LLMs instead of trying to do so in markdown.
I don't know exactly what it will look like but for example I could imagine a Java Framework where the LLM could only create subclasses of certain classes.
AmazingTurtle
today at 7:11 PM
So my finding is: planning is worth it.
For a little complex changes, I always run codex (5.5-high) in planning mode first. I have linked various docs/{ARCHITECTURE,BACKEND-GUIDELINES,NESTJS-DI,..}.md etc. from AGENTS.md so they can quickly discover relevant docs at planning time, only if they are needed. No need to know react specific stuff when it's dealing with a backend problem for example. I typically blindly approve plans made by the agent with a fresh context, because that's as if I had prompted it. Works the best for me.
Using /goal however, it's really just constantly compacting and doing it's thing, of course it gets sloppy. If only there was a state machine that would transform tickets into a Planning Mode Prompt, then use, idk. guardian approvals (somehow a "Product Management Perspective Lens" approving or making changes to the plan) and then letting a less capable or less reasoning agent execute the plan, I think that would work the best.
gkfasdfasdf
today at 2:52 PM
Odd they used GPT-5.2 and not GPT-5.2-codex. i.e. the one optimized for coding agent tasks.
yomismoaqui
today at 3:16 PM
Also they used languages with dynamic typing like Python & JS. In my experience a statically typed codebase is easier to maintain for humans so maybe it is also for agents.
When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.
bob1029
today at 3:54 PM
> Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline.
I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.
The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.
leecommamichael
today at 3:28 PM
These things don’t think. We’re going to have to reiterate this for a long time, I fear.
rbbydotdev
today at 3:42 PM
This is interesting, anecdotally I have felt like I was having better luck with raw sqlite than using an ORM in a recent typescript project, using raw sqlite queries vs drizzle
spacedoutman
today at 5:12 PM
This research is useless and nearly all other LLM research is too.
gpt 5.2 is the strongest model they tested, a nearly 6 month old model.
Traditional research can not keep up.
oulipo2
today at 4:37 PM
Exactly why you can't remove humans in the loop to assess that the solution is not only correct (which LLMs are quite bad at, once concurrency, logic, etc are involved), but also elegant, maintainable, etc
phrotoma
today at 4:34 PM
"constraint decay" isn't this just another name for the (already well understood) idea of "context rot"?
volume_tech
today at 1:03 PM
[flagged]

Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

jdlshore

zdragnar

piker

qsort

apsurd

qsort

brandensilva

withinboredom

brandensilva

mikeyouse

BlueTierOps

KaiShips

nijave

whstl

UncleEntity

Animats

jeremyjh

sigbottle

xienze

siliconc0w

KronisLV

maxbond

emp17344

jeremyjh

mjburgess

knollimar

emp17344

vishvananda

dwa3592

usrusr

whatever1

lanstin

Silhouette

p0w3n3d

bob1029

rrook

pianopatrick

AmazingTurtle

gkfasdfasdf

maleldil

beering

yomismoaqui

acbart

mrob

epgui

antonvs

bob1029

richardlblair

xcjsam

leecommamichael

emp17344

suprfnk

sheeshkebab

noosphr

antonvs

Npovview

leecommamichael

UncleEntity

Npovview

akomtu

rbbydotdev

spacedoutman

acgourley

oulipo2

phrotoma

volume_tech