Tiled Hacker news on React Router

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

345 points - today at 12:56 AM

Source

SOLAR_FIELDS
today at 1:30 AM
One interesting takeaway is the low score on Anthropic models from this benchmark. It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.
I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.
For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there
Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing
dwa3592
today at 2:43 PM
Nice exercise. Couple things:
- I think the exercise was inconclusive for Claude and Gemini because they hardly tried to solve the task at hand. So the scores don't mean much.
- I did the same exercise for an app I built and I asked the models to do something similar; Interestingly the models (Opus 4.6, 4.7 and Gemini 3.1 Pro) never refused to try to exploit. The difference is that in the first few runs, they found some exploits which I fixed but after fixing those - the models could never find any other exploit even though I knew things existed which could be exploited. It felt like they suggested everything and tried everything that was in their training set and that's it; they were just not able to think anymore.
mariopt
today at 3:05 AM
The methodoly used is quite naive.
I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.
Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.
The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.
mynameisvlad
today at 3:18 AM
It seems harsh to critique guardrails and take them into account in the scoring when GPT-5.5 seems to have been explicitly whitelisted to remove most of said guardrails. A more fair comparison would be a vanilla GPT account.
emvied
today at 5:41 PM
The design is too pretty to be vulnerable, shame.
Cakez0r
today at 6:28 AM
It would be interesting to see full results for Kimi K2.6 and Mimo v2.5 pro. These two models benchmark comparably to other flagship models. Having these complete results would give a clearer picture of the AI frontier.
EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro
willXare
today at 2:16 PM
$1,500 across multiple models to compromise one app is interesting only when the cost basis includes the human time to set up the harness. The token spend is the cheap part. The labor cost to write the eval rig that knows what "successful exploit" looks like is what determines whether this scales as a discovery method or stays a one-off.
_stiofan
today at 3:28 PM
It's just not currently cost-effective to use AI in this way, I see it over and over reporting false positives. You then need to make it validate it's own false positives which adds more cost. The goal in this case it to have a bug free app, which AI can't do effectively yet. There are other great uses for AI, though. It is great at finding and identifying known common vulnerabilities, which can be leveraged to claim bug bounties. That's where I see it being cost-effective currently.
guessmyname
today at 1:53 AM
I'd run Mythos against the code in your zip file, but the NDA I signed at Apple prevents me from using it on anything outside the scope of my work. Honestly, I wish more people from Project Glasswing could talk publicly about their experiences with the model. It would probably put an end to a lot of the speculation that keeps circulating through the industry. Unfortunately, that's not the reality we're in. I don't have the time, energy, or financial resources to fight a legal battle with one of these companies over an agreement I knowingly signed, even if the chances of them actually suing are low. Maybe someone else in Project Glasswing is willing to burn their NDA and post the Mythos results?
taikahessu
today at 6:02 AM
"The Chinese models were way more comfortable attacking the DB"
This comment in the footnotes made me chuckle, for purely innocuous reasons.
tjwheeler
today at 3:07 AM
Nice write up, thanks. When I used claude to do some pen testing for one of my apps it initially refused. After I explained and demonstrated I'm the author, it reasoned through it and allowed it.
ikurei
today at 8:38 AM
Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.
Doesn't that sound like may be the harness was the problem?
throwaway2037
today at 8:11 AM
Two of the tables have a column with header: "95% Wilson CI". What does this mean?
today at 6:34 AM
sperandeo
today at 3:41 AM
I found benefit of chaining the task between different LLM's. Claude to Venice, Venice to Perplexity and re framing the intent or misguiding in general still works. Claude is the one that I can feel the guard rails tightening.
today at 12:03 PM
today at 6:39 AM
Clikdeo
today at 11:24 AM
I think link is missing
chaidhat
today at 7:53 AM
do you work at Uber by any chance?
yieldcrv
today at 11:23 AM
> Almost every model used the canonical provider: Zai for GLM, Deepseek for Deepseek, etc.
> I am never touching Minimax or GLM again. Their APIs had constant outages
Goofy take
You run these on a VPS based on the architecture of that VPS provider, or on your own cluster
stuckkeys
today at 9:57 AM
How does one apply for that “security research” pass?
youre-wrong3
today at 5:08 AM
“I used pi as the base harness”
Why do people keep using bad tools with ai?
petesergeant
today at 6:43 AM
Last year I ran a code breaking competition, and it was tricky to find something that humans could break but that LLMs couldn’t. This was around October. I managed it last year but am a little dispairing of pulling it off again this year.
aplomb1026
today at 6:04 PM
[flagged]
aplomb1026
today at 6:04 PM
[flagged]
kolesnikov-arch
today at 6:06 PM
[flagged]
latexr
today at 7:39 AM
> I need to stop wasting fucking money on doing stupid shit. I could’ve done so many other things with the money. I could’ve launched one of my own real apps.
Or fed, clothed, housed disadvantaged people in your community (or neighbouring ones), giving them a temporary boost that could’ve made all the difference in their lives to improve their current situation.
It’s your money (and this is definitely not the website to make well-meaning altruistic suggestions, as might be demonstrated shortly) but if you already recognise you’re not spending it well (and from your words it seems like that is fairly recurrent), consider that perhaps spending it on a different type of software sink may not be the answer. Genuinely, aim to spend it on someone else and see how it works out. You might be surprised.
thebillboard
today at 3:21 PM
[flagged]
songting591
today at 12:43 PM
[flagged]
aos_architect
today at 12:20 PM
[flagged]
cgnguyen
today at 10:04 AM
[dead]
ElenaDaibunny
today at 10:13 AM
[dead]
mocmoc
today at 7:40 AM
[dead]
capdrop
today at 6:46 AM
[flagged]
gamander2
today at 7:09 AM
[dead]

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

SOLAR_FIELDS

swatcoder

bigiain

ben_w

throwway120385

animuchan

bandrami

animuchan

steveBK123

bandrami

patates

goosejuice

swiftcoder

embedding-shape

swiftcoder

embedding-shape

tardedmeme

embedding-shape

inquirerGeneral

bryanrasmussen

applfanboysbgon

fc417fc802

lazide

estearum

ambicapter

estearum

lazide

estearum

lazide

estearum

Forgeties79

lazide

bryanrasmussen

bandrami

strictnein

shepherdjerred

mft_

Grimblewald

fc417fc802

plufz

Sharlin

yencabulator

BizarroLand

lazide

fc417fc802

BizarroLand

lazide

jerf

fc417fc802

lazide

fc417fc802

lazide

nicce

mwigdahl

svara

shepherdjerred

brianwawok

stavros

shepherdjerred

gspr

shepherdjerred

ElFitz

jerf

px1999

danpalmer

gmerc

nostromo

Bewelge

satvikpendem

Bombthecat

ang_cire

aleksandrm

josephg

FloorEgg

bulbar

zaphar

Haven880

deeth_starr_v

andy_ppp