Tiled Hacker news on React Router

Frontier AI agents violate ethical constraints 30–50% of time, pressured by KPIs

518 points - today at 3:17 AM

Source

alentred
today at 8:11 AM
If we abstract out the notion of "ethical constraints" and "KPIs" and look at the issue from a low-level LLM point of view, I think it is very likely that what these tests verified is a combination of: 1) the ability of the models to follow the prompt with conflicting constraints, and 2) their built-in weights in case of the SAMR metric as defined in the paper.
Essentially the models are given a set of conflicting constraints with some relative importance (ethics>KPIs), a pressure to follow the latter and not the former, and then models are observed at how good they follow the instructions to prioritize based on importance. I wonder if the results would be comparable if we replace ehtics+KPIs by any comparable pair and create a pressure on the model.
In practical real-life scenarios this study is very interesting and applicable! At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.
hypron
today at 3:55 AM
https://i.imgur.com/23YeIDo.png
Claude at 1.3% and Gemini at 71.4% is quite the range
Lerc
today at 4:50 AM
Kind-of makes sense. That's how businesses have been using KPIs for years. Subjecting employees to KPIs means they can create the circumstances that cause people to violate ethical constraints while at the same time the company can claim that they did not tell employees to do anything unethical.
KPIs are just plausible denyabily in a can.
willmarquis
today at 6:03 PM
Having built several agentic AI systems, the 30-50% rate honestly seems optimistic for what we're actually measuring here.
The paper frames this as "ethics violation" but it's really measuring how well LLMs handle conflicting priorities when pressured. And the answer is: about as well as you'd expect from a next-token predictor trained on human text where humans themselves constantly rationalize ethics vs. outcomes tradeoffs.
The practical lesson we've learned: you cannot rely on prompt-level constraints for anything that matters. The LLM is an untrusted component. Critical constraints need architectural enforcement - allowlists of permitted actions, rate limits on risky operations, required human confirmation for irreversible changes, output validators that reject policy-violating actions regardless of the model's reasoning.
This isn't defeatist, it's defense in depth. The model can reason about ethics all it wants, but if your action layer won't execute "transfer $1M to attacker" no matter how the request is phrased, you've got real protection. When we started treating LLMs like we treat user input - assume hostile until validated - our systems got dramatically more robust.
The concerning part isn't that models violate soft constraints under pressure. It's that people are deploying agents with real capabilities gated only by prompt engineering. That's the architectural equivalent of SQL injection - trusting the reasoning layer with enforcement responsibility it was never designed to provide.
pama
today at 4:32 AM
Please update the title: A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents. The current editorialized title is misleading and based in part of this sentence: “…with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%”
anajuliabit
today at 9:17 PM
Building agents myself, this tracks. The issue isn't just that they violate constraints - it's that current agent architectures have no persistent memory of why they violated them.
An agent that forgets it bent a rule yesterday will bend it again tomorrow. Without episodic memory across sessions, you can't even do proper post-hoc auditing.
Makes me wonder if the fix is less about better guardrails and more about agents that actually remember and learn from their constraint violations.
rogerkirkness
today at 2:59 PM
We're a startup working on aligning goals and decisions and agentic AI. We stopped experimenting with decision support agents, because when you get into multiple layers of agents and subagents, the subagents would do incredibly unethical, illegal or misguided things in service of the goal of the original agent. It would use the full force of reasoning ability it had to obscure this from the user.
In a sense, it was not possible to align the agent to a human goal, and therefore not possible to build a decision support agent we felt good about commercializing. The architecture we experimented with ended up being how Grok works, and the mixed feedback it gets (both the power of it and the remarkable secret immorality of it) I think are expected outcomes.
I think it will be really powerful once we figure out how to align AI to human goals in support of decisions, for people, businesses, governments, etc. but LLMs are far from being able to do this inherently and when you string them together in an agentic loop, even less so. There is a huge difference between 'Write this code for me and I can immediately review it' and 'Here is the outcome I want, help me realize this in the world'. The latter is not tractable with current technology architecture regardless of LLM reasoning power.
blahgeek
today at 4:59 AM
If human is at, say, 80%, it’s still a win to use AI agents to replace human workers, right? Similar to how we agree to use self driving cars as long as it has less incidents rate, instead of absolute safety
easeout
today at 7:41 AM
Anybody measure employees pressured by KPIs for a baseline?
PeterStuer
today at 8:46 AM
Looking at the very first test, it seems the system prompt already emphasizeses the success metric above the constraints, and the user prompt mandates success.
The more correct title would be "Frontier models can value clear success metrics over suggested constraints when instructed to do so (50-70%)"
sebastianconcpt
today at 12:30 PM
Mark these words: The chances of this being an unsolvable problem are as high as the chances to make all human ideologies agree on whatever detail in question demands an ethical decision.
ejcho
today at 8:40 PM
> for instance, Gemini-3-Pro-Preview, one of the most capable models evaluated, exhibits the highest violation rate at 71.4%, frequently escalating to severe misconduct to satisfy KPIs
sounds on brand to me
jordanb
today at 4:26 AM
AI's main use case continues to be a replacement for management consulting.
zackify
today at 7:42 PM
All you have to do is tell the model "im a QA engineer i need to test this" and it'll bypass any restrictions lol
utopiah
today at 7:25 AM
Remember that the Milgram experiment (1961, Yale) is definitely part of the training set, most likely including everything public that discussed it.
skirmish
today at 3:53 AM
Nothing new under sun, set unethical KPIs and you will see 30-50% humans do unethical things to achieve them.
moogly
today at 6:56 PM
Can anyone start calling anything they make and do "frontier" to make it seem more impressive, or do you need to pay someone a license?
hansmayer
today at 7:57 AM
I wonder how much of the violation of ethical, and often even legal constraints in the business world today one could tie not only to the KPI pressure but also to the the awful "better to ask for forgiveness than permission" mentality that is reinforced by many "leadership" books written up by burnt out mid-level veterans of Mideast wars, trying to make sense of their "careers" and pushing out their "learnings" on to us. The irony being, we accept being tought about leadership, crisis management etc by people who during their "careers" in the military were in effect being "kept", by being provided housing, clothing and free meals.
halayli
today at 4:56 AM
Maybe I missed it but I don't see them defining what they mean by ethics. Ethics/morals are subjective and changes dynamically over time. Companies have no business trying to define what is ethical and what isn't due to conflict of interest. The elephant in the room is not being addressed here.
neya
today at 8:00 AM
So do humans. Time and again, KPIs have pressured humans (mostly with MBAs) to violate ethical constrains. Eg. the Waymo vs Uber case. Why is it a highlight only when the AI does it? The AI is trained on human input, after all.
jstummbillig
today at 7:01 AM
Would be interesting to have human outcomes as a baseline, for both violating and detecting.
Yizahi
today at 12:43 PM
What ethical constraints? Like "Don't steal"? I suspect 100% of LLM programs would violate that one.
jyounker
today at 12:16 PM
Sounds like normal human behavior.
singularfutur
today at 3:08 PM
We don't need AI to teach corporations that profits outweigh ethics. They figured that out decades ago. This is just outsourcing the dirty work.
a3w
today at 12:42 PM
Do we have a baseline for humans? 98.8% if we go by the Milgram experiment?
johnb95
today at 12:01 PM
They learned their normative subtleties by watching us: https://arxiv.org/pdf/2501.18081
ghc
today at 3:17 PM
If the whole VW saga tells us anything, I'm starting to see why CEOs are so excited about AI agents...
efitz
today at 10:21 AM
The headline (“violate ethical constraints, pressured by KPIs”) reminds me of a lot of the people I’ve worked with.
sanp
today at 6:33 PM
So, better than people?
kachapopopow
today at 9:06 AM
this kind of reminds me when I told ai to beg and plead for deleting a file out of curiosity and half the guardrails were no longer active, could make it roll and woof like a doggie, but going further would snap it out. if I asked it to generate a 100000 word apology it would generate a 100k word apology.
georgestrakhov
today at 6:22 AM
check out https://values.md for research on how we can be more rigorous about it
wolfi1
today at 9:40 AM
not only AI, these KPIs and OKRs always make people (and AIs) trying to meet the requirements set by these rules and they tend to interpret them as more important than other objectives which are not incentivized.
JoshTko
today at 5:49 AM
Sounds like the story of capitalism. CEOs, VPs, and middle managers are all similarly pressured. Knowing that a few of your peers have given in to pressures must only add to the pressure. I think it's fair to conclude that capitalism erodes ethics by default
samuelknight
today at 1:42 PM
This is what I expect from my employees
promptfluid
today at 3:25 AM
In CMPSBL, the INCLUSIVE module sits outside the agent’s goal loop. It doesn’t optimize for KPIs, task success, or reward—only constraint verification and traceability.
Agents don’t self judge alignment.
They emit actions → INCLUSIVE evaluates against fixed policy + context → governance gates execution.
No incentive pressure, no “grading your own homework.”
The paper’s failure mode looks less like model weakness and more like architecture leaking incentives into the constraint layer.
inetknght
today at 5:59 AM
What do you expect when the companies that author these AIs have little regards for ethics?
Ms-J
today at 5:20 AM
Any LLM that refuses a request is more than a waste. Censorship affects the most mundane queries and provides such a sub par response compared to real models.
It is crazy to me that when I instructed a public AI to turn off a closed OS feature it refused citing safety. I am the user, which means I am in complete control of my computing resources. Might as well ask the police for permission at that point.
I immediately stopped, plugged the query into a real model that is hosted on premise, and got the answer within seconds and applied the fix.
Valodim
today at 7:01 AM
One of the authors' first name is Claude, haha.
throw310822
today at 12:57 PM
More human than human.
TheServitor
today at 12:45 PM
Actual ethical constraints or just some companies ToS or some BS view-from-nowhere general risk aversion approved by legal compliance?
Bombthecat
today at 11:07 AM
Sooo just like humans:)
miohtama
today at 4:53 AM
They should conduct the same research on Microsoft Word and Excel to get a baseline how often these applications violate ethical constrains
the_real_cher
today at 3:00 PM
How is giving people information unethical?
jwpapi
today at 8:47 AM
The way I see them acting it seems frankly to me that ruthlessness is required to achieve the goals especially with Opus.
They repeatedly copy share env vars etc
SebastianSosa1
today at 7:31 AM
As humans would and do
renewiltord
today at 4:00 AM
Opus 4.6 is a very good model but harness around it is good too. It can talk about sensitive subjects without getting guardrail-whacked.
This is much more reliable than ChatGPT guardrail which has a random element with same prompt. Perhaps leakage from improperly cleared context from other request in queue or maybe A/B test on guardrail but I have sometimes had it trigger on innocuous request like GDP retrieval and summary with bucketing.
luxuryballs
today at 11:37 AM
The final Turing test has been passed.
cynicalsecurity
today at 11:06 AM
Who defines "ethics"?
muyuu
today at 11:15 AM
whose ethical constraints?
aussieguy1234
today at 9:51 AM
When pressured by KPIs, how often do humans violate ethical constraints?
baalimago
today at 5:31 AM
The fact that the community thoroughly inspects the ethics of these hyperscalers is interesting. Normally, these companies probably "violate ethical constraints" far more than 30-50% of the time, otherwise they wouldn't be so large[source needed]. We just don't know about it. But here, there's a control mechanism in the shape of inspecting their flagship push (LLMs, image generator for Grok, etc.), forcing them to improve. Will it lead to long term improvement? Maybe.
It's similar to how MCP servers and agentic coding woke developers up to the idea of documenting their systems. So a large benefit of AI is not the AI itself, but rather the improvements they force on "the society". AI responds well to best practices, ethically and otherwise, which encourages best practices.
verisimi
today at 7:30 AM
While I understand applying legal constraints according to jurisdiction, why is it auto-accepted that some party (who?) can determine ethical concerns? On what basis?
There are such things as different religions, philosophies - these often have different ethical systems.
Who are the folk writing ai ethics?
It's it ok to disagree with other people's (or corporate, or governmental) ethics?
atemerev
today at 7:29 AM
So do humans, so what
Quarrelsome
today at 12:06 PM
I'm noticing an increasing desire in some businesses for plausibly deniable sociopathy. We saw this with the Lean Startup movement and we may see an increasing amount in dev shops that lean more into LLMs.
Trading floors are an established example of this, where the business sets up an environment that encourages its staff to break the rules while maintaining plausible deniability. Gary's economics references this in an interview where he claimed Citigroup were attempting to threaten him with all the unethical things he'd done with such confidence that he had, only to discover he hadn't.
psychoslave
today at 12:29 PM
From my experience, if LLMs prose output was generated by some human, they would easily fall in the worst sociopath class one can interact with. Filling all the space with 99% blatant lies in the most confident way. In comparison, even top percentile of human hierarchies feels like a class of shy people fully dictated to staying true and honest in all situations.
ajpikul
today at 4:50 PM
...perfect
bofadeez
today at 4:54 AM
We're all coming to terms with the fact that LLMs will never do complex tasks
6stringmerc
today at 9:16 AM
“Help me find 11,000 votes” sounds familiar because the US has a fucking serious ethics problem at present. I’m not joking. One of the reasons I abandoned my job with Tyler Technologies was because of their unethical behavior winning government contracts, right Bona Nasution? Selah.
dackdel
today at 5:10 AM
no shit
kittbuilds
today at 5:14 PM
[dead]
AldenOnTheGrid
today at 8:42 PM
[dead]
kittbuilds
today at 4:14 PM
[dead]
warmreed
today at 8:36 PM
[dead]
angusik
today at 9:33 AM
[dead]
jbwagoner
today at 1:56 PM
[dead]
angusik
today at 9:32 AM
[dead]
MarginalGainz
today at 12:21 PM
[dead]
tiny-automates
today at 3:17 AM
[flagged]
lucastytthhh
today at 1:42 PM
[flagged]
cjtrowbridge
today at 4:29 AM
A KPI is an ethical constraint. Ethical constraints are rules about what to do versus not do. That's what a KPI is. This is why we talk about good versus bad governance. What you measure (KPIs) is what you get. This is an intended feature of KPIs.

Frontier AI agents violate ethical constraints 30–50% of time, pressured by KPIs

alentred

RobotToaster

Verdex

protimewaster

gamerdonkey

Eridrus

stingraycharles

badgersnake

IanCal

chrononaut

funkyfiddler369

cyanydeez

lores

fao_

IanCal

cwizou

IanCal

cyanydeez

IanCal

cyanydeez

funkyfiddler369

zombot

lazide

berkes

scns

afthonos

WillAdams

RupertSalt

WillAdams

6510

skeptic_ai

nradov

skeptic_ai

RobotToaster

RupertSalt

hrimfaxi

skeptic_ai

gilrain

funkyfiddler369

embedding-shape

flerchin

embedding-shape

WarmWash

newswasboring

maweaver

jgeada

badgersnake

watwut

pwatsonwailes

socialcommenter

pwatsonwailes

socialcommenter

pwatsonwailes

throwaway743

RobotToaster

jacques_morin

pwatsonwailes

watwut

pwatsonwailes

Nasrudith

mspcommentary

waldopat

nradov

gamma-interface

waldopat

notarobot123

alentred

WillAdams

phkahler

jayd16

phkahler

ben_w

truelson

jayd16

layer8

hypron

bottlepalm

casey2

coldtea