Tiled Hacker news on React Router

AI agent benchmarks are broken

167 points - today at 1:06 PM

Source

jerf
today at 1:41 PM
When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.
However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.
Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.
btdmaster
today at 9:53 PM
There is a cool solution for this: https://huggingface.co/spaces/Jellyfish042/UncheatableEval
This doesn't work for instruction-tuned models, but it's an interesting alternative approach that doesn't need a complicated (and thus gameable) evaluation function or human interaction. Instead, predict the next word with data newer than the training set.
deepdarkforest
today at 1:30 PM
It's very funny how many layers of abstraction we are going through. We have limited understanding of how LLM's work exactly and why. We now do post training with RL, which again, we don't have perfect understanding of it either. Then you stack LLMs calls and random tools, and you have agents, and you are attempting to benchmark those. (and this exclude voice, computer use agents etc).
It's all just vibes,there is no good general benchmark for agents and i think it's just impossible, there are just way too many degrees of freedom to achieve anything useful. They're just a complicated tool to achieve things. It's like trying to make a general use benchmark of a stack of 10 microservices together. It does not make sense, it just depends on your usecase and your own metrics
ttoinou
today at 2:41 PM
What makes LLMs amazing (fuzzy input, fuzzy output) is exactly why they are hard to benchmark. If they could be benchmarked easily, they wouldn't be powerful by definition. I have no idea what's going on in the minds of people benchmarking LLMs for fuzzy tasks, and in the minds of people relying on benchmarks to make decisions about LLMs, I never looked at them. People doing benchmarks have to prove what they do is useful, not us public proving them they're doing it wrong.
Of course, for such tasks we could benchmark them :
* arithmetic (why would use LLM for that ?)
* correct JSON syntax, correct command lines etc.
* looking for specific information in a text
* looking for a missing information in a text
* language logic (ifs then elses where we know the answer in advance)
But by Goodhart's Law, LLMs that have been trained to succeed in those benchmarks might loose powerfulness in others tasks where we really need them (fuzzy inputs, fuzzy outputs)
rybosworld
today at 8:53 PM
Based on the comments, I think a lot of people are missing what the AI Agent actually got wrong here. Nowhere did the agent claim that 45 + 8 = 63.
You can see the Agent's step by step thought process here (also linked in the article):
https://ibm-cuga.19pc1vtv090u.us-east.codeengine.appdomain.c...
The Agent correctly entered the starting point (MIT) and the ending point (Harvard) and the mode of transport (on foot). OpenStreetMap returns this as taking 45 minutes long.
Then the agent reversed the directions, and changed the mode of transport to car. What it should have also done, is change the destination to Logan Airport. This is the part that the agent missed. OpenStreetMap then returns that the drive from Harvard to MIT takes 8 minutes.
The agent then returned the answer as being 45 minutes walking and 8 minutes driving. The first number is correct. The second is wrong because the agent chose the wrong destination, not because it did math incorrectly.
Seems like lots of readers are chomping at the bit to prove how stupid the models are rather than focus on the real problem the author is highlighting.
anupj
today at 1:25 PM
AI agent benchmarks are starting to feel like the self-driving car demos of 2016: impressive until you realize the test track has speed bumps labeled "success"
rsynnott
today at 2:23 PM
> 45 + 8 = 63
> Pass
Yeah, this generally feels like about the quality one would expect from the industry.
beebmam
today at 3:59 PM
I don't think "Benchmarks" are the right way to analyze AI-related processes, which is probably similar to the complexity surrounding human intelligence measurements and how well each human can handle real-world problems.
RansomStark
today at 1:27 PM
I really like the CMU Agents Company approach of simulating a real world environment [0]. Is it perfect, no. Does it show you want to expect in production, not really, but it's much closer than anything else I've seen.
[0] https://the-agent-company.com/
TheOtherHobbes
today at 2:02 PM
Any sufficiently hyped technology is indistinguishable from magic.
neehao
today at 4:32 PM
And I would say, often we need effortful labels by groups of humans: https://www.gojiberries.io/superhuman-level-performance/
mycall
today at 1:49 PM
SnitchBench [0] is unique benchmark which shows how aggressively models will snitch on you via email and CLI tools when they are presented with evidence of corporate wrongdoing - measuring their likelihood to "snitch" to authorities. I don't believe they were trained to do this, so it seems to be an emergent ability.
[0] https://snitchbench.t3.gg/
KTibow
today at 5:05 PM
This is more or less a funnel to their Agentic Benchmark Checklist: https://arxiv.org/abs/2507.02825
let_tim_cook_
today at 2:37 PM
Are any authors here? Have you looked at AppWorld? https://appworld.dev
xnx
today at 1:33 PM
All benchmarks are flawed. Some benchmarks are useful.
greatpostman
today at 1:39 PM
Benchmarks aren’t broken, the models can learn anything. If we give them true real world data (physics engine), they will learn the real world. We are going to see artificial general intelligence in our lifetime
camdenreslink
today at 1:59 PM
The current benchmarks are good for comparing between models, but not for measuring absolute ability.

AI agent benchmarks are broken

jerf

sdenton4

ttoinou

sdenton4

ttoinou

layer8

DonHopkins

potatolicious

rsynnott

alextheparrot

majormajor

jgraettinger1

majormajor

alextheparrot

brookst

diggan

alextheparrot

tempfile

alextheparrot

rf15

e1g

brookst

suddenlybananas

alextheparrot

xnx

jerf

jstummbillig

jacobr1

brookst

jerf

rsynnott

brookst

Jensson

qsort

xnx

einrealist

fragmede

szvsw

BoiledCabbage

datpuz

DonHopkins

dmbche

dmbche

AIPedant

btdmaster

deepdarkforest

bwfan123

th0ma5

rf15

ttoinou

meroes

ttoinou

potatolicious

th0ma5

rybosworld

suddenlybananas

asadotzler

anupj

rsynnott

beebmam

RansomStark

yeahyeahok

TheOtherHobbes

neehao

mycall

KTibow

nerevarthelame

let_tim_cook_

xnx

yifanl

lcnPylGDnU4H9OF

suddenlybananas

layer8

greatpostman

hddbbdbfnfdk

camdenreslink

qsort

fourside

rsynnott