\

AI agent benchmarks are broken

167 points - today at 1:06 PM

Source
  • jerf

    today at 1:41 PM

    When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

    However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

    Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

      • sdenton4

        today at 2:15 PM

        When I was working in audio compression, evaluation was very painful because we had no programmatic way to measure how good some reconstructed audio sounds to a human. Any metric you could come up with was gameable, and direct optimization would lead to artifacts.

        As a result, we always had a two-step evaluation process. We would use a suite of metrics to guide development progress (validation), but the final evaluation reported in a paper always involved subjective human listening experiments. This was expensive, but the only way to show that the codecs were actually improving.

        Similarly, here it seems fine to use LLMs to judge your work in progress, but we should be requiring human evaluation for 'final' results.

          • ttoinou

            today at 2:42 PM

            Wouldn't that process avoid you finding a better subjective audio codec that doesn't reduce typical metrics (PSNR etc.) ? another process would rather be to first construct a metric software that tries to be similar to the subjective experience of humans, then use that to create audio codecs optimizing this metric

              • sdenton4

                today at 5:09 PM

                There's two answers to that....

                The first is, how do you know the subjective optimization your making is actually any good? You're just moving the problem back one layer of abstraction.

                The second is, we did that, eventually, by training models to predict subjective listening scores from the giant pile of subjective test data we had collected over the years. (ViSQoL) It's great, but we still don't trust it for end-of-the-day, cross codec comparison, because we don't want to reward overfit on the trained model.

                https://arxiv.org/abs/2004.09584

                  • ttoinou

                    today at 5:45 PM

                    Nice

                    Well yeah you would still need human testing

                • layer8

                  today at 3:39 PM

                  You are describing psychoacoustic models, which work to a reasonable extent for lossy compression of audio (MP3 and successors are based on them), but I can see how it would be much more difficult/less helpful for reconstructing audio.

              • DonHopkins

                today at 3:58 PM

                You gotta snag yourself one of those awesome KEMAR dummy head and torso simulators, preferably the fully accessorized luxury edition that comes with the heavy duty portable travel case with lots of room for extra ears and microphones and wigs, which is so much fun to take through airport security.

                They were great for taking to Grateful Dead concerts to record the music directly in front of the Wall of Sound, and to measure the response so you can play back all your Dead tapes with that same front row psychoacoustic perspective. ;)

                https://www.grasacoustics.com/industries/kemar/applications-...

                https://www.grasacoustics.com/products/accessories/product/4...

            • potatolicious

              today at 1:52 PM

              > "I'm particularly annoyed by using LLMs to evaluate the output of LLMs."

              +1, and IMO part of a general trend where we're just not serious about making sure this shit works. Higher scores make stonks go up, who cares if it actually leads to reliably working products.

              But also more importantly it's starting to expose the fact that we haven't solved one of ML's core challenges: data collection and curation. On the training side we have obviated this somewhat (by ingesting the whole internet, for example), but on the eval side it feels like we're increasing just going "actually constructing rigorous evaluation data, especially at this scale, would be very expensive... so let's not".

              I was at a local tech meetup recently where a recruiting firm was proudly showing off the LLM-based system they're using to screen candidates. They... did not evaluate the end-to-end efficacy of their system. At all. This seems like a theme within our industry - we're deploying these systems based purely on vibes without any real quantification of efficacy.

              Or in this case, we're quantifying efficacy... poorly.

                • rsynnott

                  today at 3:11 PM

                  > +1, and IMO part of a general trend where we're just not serious about making sure this shit works.

                  I suspect quite a lot of the industry is actively _opposed_ to that, because it could be damaging for the "this changes everything" narrative.

              • alextheparrot

                today at 2:14 PM

                LLMs evaluating LLM outputs really isn’t that dire…

                Discriminating good answers is easier than generating them. Good evaluations write test sets for the discriminators to show when this is or isn’t true. Evaluating the outputs as the user might see them are more representative than having your generator do multiple tasks (e.g. solve a math query and format the output as a multiple choice answer).

                Also, human labels are good but have problems of their own, it isn’t like by using a “different intelligence architecture” we elide all the possible errors. Good instructions to the evaluation model often translate directly to better human results, showing a correlation between these two sources of sampling intelligence.

                  • majormajor

                    today at 3:56 PM

                    > Discriminating good answers is easier than generating them.

                    I don't think this is true for many fields - especially outside of math/programming. Let's say the task is "find the ten most promising energy startups in Europe." (This is essentially the sort of work I see people frequently talk about using research modes of models for here or on LinkedIn.)

                    In ye olden days pre-LLM you'd be able to easily filter out a bunch of bad answers from lazy humans since they'd be short, contain no detail, have a bunch of typos, formatting inconsistencies from copy-paste, etc. You can't do that for LLM output.

                    So unless you're a domain expert on European energy startups you can't check for a good answer without doing a LOT of homework. And if you're using a model that usually only looks at, say, the top two pages of Google results to try to figure this out, how is the validator going to do better than the original generator?

                    And what about when the top two pages of Google results start turning into model-generated blogspam?

                    If your benchmark can't evaluate prospective real-world tasks like this, it's of limited use.

                    A larger issue is that once your benchmark, that used this task as a criteria, based on an expert's knowledge, is published, anyone making an AI Agent is incredibly incentivized to (intentionally or not!) to train specifically on this answer without necessarily actually getting better at the fundamental steps in the task.

                    IMO you can never use an AI agent benchmark that is published on the internet more than once.

                      • jgraettinger1

                        today at 5:09 PM

                        > You can't do that for LLM output.

                        That's true if you're just evaluating the final answer. However, wouldn't you evaluate the context -- including internal tokens -- built by the LLM under test ?

                        In essence, the evaluator's job isn't to do separate fact-finding, but to evaluate whether the under-test LLM made good decisions given the facts at hand.

                          • majormajor

                            today at 6:19 PM

                            I would if I was the developer, but if I'm the user being sold the product, or a third-party benchmarker, I don't think I'd have full access to that if most of that is happening in the vendor's internal services.

                        • alextheparrot

                          today at 8:20 PM

                          > Good evaluations write test sets for the discriminators to show when this is or isn’t true.

                          If they can’t write an evaluation for the discriminator I agree. All the input data issues you highlight also apply to generators.

                          • brookst

                            today at 7:34 PM

                            > IMO you can never use an AI agent benchmark that is published on the internet more than once.

                            This is a long-solved problem far predating AI.

                            You do it by releasing 90% of the benchmark publicly and holding back 10% for yourself or closely trusted partners.

                            Then benchmark performance can be independently evaluated to determine if performance on the 10% holdback matches the 90% public.

                        • diggan

                          today at 5:05 PM

                          > Discriminating good answers is easier than generating them.

                          Lots of other good replies to this specific part, but also, lots of developers are struggling with the feeling that reviewing code is harder than writing code (something I personally not sure I agree with), seen that sentiment being shared here on HN a lot, and would directly go against that particular idea.

                            • alextheparrot

                              today at 8:16 PM

                              I wish the other replies and this would engage with the sentence right after it indicating that you should test this premise empirically.

                          • tempfile

                            today at 4:09 PM

                            > Discriminating good answers is easier than generating them.

                            This is actually very wrong. Consider for instance the fact that people who grade your tests in school are typically more talented, capable, trained than the people taking the test. This is true even when an answer key exists.

                            > Also, human labels are good but have problems of their own,

                            Granted, but...

                            > it isn’t like by using a “different intelligence architecture” we elide all the possible errors

                            nobody is claiming this. We elide the specific, obvious problem that using a system to test itself gives you no reliable information. You need a control.

                              • alextheparrot

                                today at 8:25 PM

                                It isn’t actually very wrong. Your example is tangential as graders in school have multiple roles — teaching the content and grading. That’s an implementation detail, not a counter to the premise.

                                I don’t think we should assume answering a test would be easy for a Scantron machine just because it is very good at grading them, either.

                                • rf15

                                  today at 7:35 PM

                                  Trading control for convenience has always been the tradeoff in the recent AI hype cycle and the reason why so many people like to use ChatGPT.

                              • e1g

                                today at 3:12 PM

                                Agree, current "thinking" models are effectively "re-run this question N times, and determine the best answer", and this LLM-evaluating-LLM loop demonstrably leads to higher quality answers against objective metrics (in math, etc).

                                  • brookst

                                    today at 7:35 PM

                                    That’s… not how thinking models work. They tend to be iterative and serial, not parallel and then pick-one.

                                • suddenlybananas

                                  today at 2:33 PM

                                  What's 45+8? Is it 63?

                                    • today at 3:32 PM

                                      • alextheparrot

                                        today at 3:37 PM

                                        If this sort of error isn’t acceptable, it should be part of an evaluation set for your discriminator

                                        Fundamentally I’m not disagreeing with the article, but also think most people who care take the above approach because if you do care you read samples, find the issues, and patch them to hill climb better

                                • xnx

                                  today at 3:21 PM

                                  > I'm particularly annoyed by using LLMs to evaluate the output of LLMs

                                  This does seem a little crazy on its face, but it is yielding useful and improving tools.

                                    • jerf

                                      today at 3:47 PM

                                      It's not about it being crazy and it's not about personal opinions about AI. It's about chaos mathematics. Iterating with the same system like that has certain easy-to-understand failure states. It's why I phrased it specifically in terms of using the same architecture to validate itself. If we had two radically different AI architectures that were capable of evaluating each other, firing them at each other for evaluation purposes would be much, much less susceptible to this sort of problem than firing either of them at themselves. That will never be a good idea.

                                      See also a cousin comment of mine observing that human brains are absolutely susceptible to the same effect. We're just so used to it that it is the water we swim through. (And arguably human brains are more diverse than current AI systems functioning at this level. No bet on how long that will be true for, though.)

                                      Such composite systems would still have their own characteristics and certainly wouldn't be guaranteed to be perfect or anything, but at least they would not tend to iteratively magnify their own individual flaws.

                                      Perhaps someday we will have such diverse architectures. We don't today have anything that can evaluate LLMs other than human brains, though.

                                  • jstummbillig

                                    today at 2:48 PM

                                    > using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test.

                                    That's what humans do all the time. What's the fundamental difference? Or are you saying that's also broken?

                                      • jacobr1

                                        today at 4:04 PM

                                        The equivalent would be having the _same human_ review their own work. We require others with different experience and fresh eyes for secondary review and for the most important task multiple people.

                                        To some extent the same llm with a new context history and different prompt is sorta like that ... but still is much weaker than using a different system entirely.

                                          • brookst

                                            today at 7:46 PM

                                            How do you feel about o3 reviewing 4o-mini?

                                        • jerf

                                          today at 3:40 PM

                                          Yes, humans evaluating humans also causes human foibles to be magnified.

                                          I cite the entire current education system. Substantiating that claim would take more than an HN comment allows, though I think most people can probably get the drift of what I'm talking about, even if we'd disagree about the details. Absolutely humans are not immune to this.

                                          I also cite the entire concept of "fallacies", many of which are things that both human brains tend to produce and then tend to evaluate poorly. An alien species might find some of our fallacies absolutely transparent, and have entirely different fallacies of their own that none of us would find convincing in the slightest, because of fundamentally different brain architectures.

                                          I don't think AIs are ready for this yet and I don't expect LLMs ever will be, but in the future getting an outsider perspective from them in a sort of Mixture of Experts architecture could be valuable for life decisions. (I look to the future AI architectures in which LLMs are just a component but not the whole.)

                                          • rsynnott

                                            today at 3:09 PM

                                            ... I mean, when evaluating "45 + 8 minutes" where the expected answer was "63 minutes", as in the article, a competent human reviewer does not go "hmm, yes, that seems plausible, it probably succeeded, give it the points".

                                            I know LLM evangelists love this "humans make mistakes too" line, but, really, only an _exceptionally_ incompetent human evaluator would fall for that one.

                                              • brookst

                                                today at 7:48 PM

                                                have you ever hired human evaluators at scale? They make all sorts of mistakes. Relatively low probability, so it’s a noise factor in, but I have yet to meet the human who is 100% accurate at simple tasks done thousands of times.

                                                  • Jensson

                                                    today at 10:01 PM

                                                    Which is why you hire them at scale as you say, then they are very reliable. LLM at scale are not.

                                                    The problem with these AI models is there is no such point where you can just scale them up and they can solve problems as accurately as a group of humans. They add too much noise and eventually go haywire when left to their own devices.

                                            • qsort

                                              today at 3:02 PM

                                              We want machines that are better than humans, otherwise what purpose do they serve?

                                                • xnx

                                                  today at 3:20 PM

                                                  A machine with human level "AI" is still useful if it can run 24/7 and you can spin up 1M instances.

                                                    • einrealist

                                                      today at 6:11 PM

                                                      And boil the planet.

                                                      • fragmede

                                                        today at 7:42 PM

                                                        and they don't have family that gets sick or dies or come into work hungover or go off on political tangents and cause HR issues or want to take vacations or complain about bad working conditions.

                                            • szvsw

                                              today at 6:34 PM

                                              > I'm particularly annoyed by using LLMs to evaluate the output of LLMs.

                                              Even though I largely agree with parts of what you wrote, if you squint your eyes enough you can kind of see an argument along the lines of “difficult to solve but easy to verify.”

                                              • BoiledCabbage

                                                today at 2:28 PM

                                                > Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

                                                There is a simple improvement here: give the agent a "do nothing" button. That way it at least needs to understand the task well enough to know it should press the do nothing button.

                                                Now a default agent that always presses it still shouldn't score 38%, but that's better than a NOP agent scoring 38%.

                                                • datpuz

                                                  today at 3:05 PM

                                                  Benchmarks in software have always been bullshit. AI benchmarks are just even more bullshit since they're trying to measure something significantly more subjective and nuanced than most.

                                                  • DonHopkins

                                                    today at 3:47 PM

                                                    It's like using steel to produce steel. What else are you going to use? Bamboo?

                                                      • dmbche

                                                        today at 3:54 PM

                                                        I'm not sure if I'm dense, but we don't use steel to make steel (whether crucibles or "feed material").

                                                        The first person to make steel made it without steel didn't they?

                                                        Did I miss something?

                                                        Edit0: fun tidbit - Wootz steel was made with crucibles of clay with rice husks mixed in (husks would carbonize quickly and introduce air layers to better isolate) and many seemingly random objects (fruits, vegetation) were added to the crucible to control carbon content.

                                                        I higly recommend A Collection of Unmitigated Pedantry's series on steel (it's a blog, just search "ACOUP steel".

                                                          • dmbche

                                                            today at 7:48 PM

                                                            Second fun tidbit : Bamboo was used as the fuel source in some furnaces - they did indeed use bamboo like the parent comment mentionned.

                                                        • AIPedant

                                                          today at 5:01 PM

                                                          It's more like using a faulty and dangerous automated foundry to make steel when you could just hire steelworkers.

                                                          That's the real problem here - these companies are swimming in money and have armies of humans working around the clock training LLMs, there is no honest reason to nickel-and-dime the actual evaluation of benchmarks. It's like OpenAI using exact text search to identify benchmark contamination for the GPT-4 technical report. I am quite certain they had more sophisticated tools available.

                                                  • btdmaster

                                                    today at 9:53 PM

                                                    There is a cool solution for this: https://huggingface.co/spaces/Jellyfish042/UncheatableEval

                                                    This doesn't work for instruction-tuned models, but it's an interesting alternative approach that doesn't need a complicated (and thus gameable) evaluation function or human interaction. Instead, predict the next word with data newer than the training set.

                                                    • deepdarkforest

                                                      today at 1:30 PM

                                                      It's very funny how many layers of abstraction we are going through. We have limited understanding of how LLM's work exactly and why. We now do post training with RL, which again, we don't have perfect understanding of it either. Then you stack LLMs calls and random tools, and you have agents, and you are attempting to benchmark those. (and this exclude voice, computer use agents etc).

                                                      It's all just vibes,there is no good general benchmark for agents and i think it's just impossible, there are just way too many degrees of freedom to achieve anything useful. They're just a complicated tool to achieve things. It's like trying to make a general use benchmark of a stack of 10 microservices together. It does not make sense, it just depends on your usecase and your own metrics

                                                        • bwfan123

                                                          today at 1:36 PM

                                                          I can hear echos of an earlier era.

                                                          There was yahoo-pipes and web-services frameworks which rhyme with MCP and agentic.

                                                            • th0ma5

                                                              today at 9:24 PM

                                                              Pipes and services in general are reliable but the issues were social and economic. Getting everyone to agree was seen as a great way to poach users and give up control, plus the usual problems with open world vs. closed world assumptions. Thanks for mentioning this!

                                                          • rf15

                                                            today at 7:42 PM

                                                            > We have limited understanding of how LLM's work exactly and why.

                                                            blatantly untrue, and as a concept only useful to those who want to sell AI as this "magical thing" that "just works"

                                                        • ttoinou

                                                          today at 2:41 PM

                                                          What makes LLMs amazing (fuzzy input, fuzzy output) is exactly why they are hard to benchmark. If they could be benchmarked easily, they wouldn't be powerful by definition. I have no idea what's going on in the minds of people benchmarking LLMs for fuzzy tasks, and in the minds of people relying on benchmarks to make decisions about LLMs, I never looked at them. People doing benchmarks have to prove what they do is useful, not us public proving them they're doing it wrong.

                                                          Of course, for such tasks we could benchmark them :

                                                          * arithmetic (why would use LLM for that ?)

                                                          * correct JSON syntax, correct command lines etc.

                                                          * looking for specific information in a text

                                                          * looking for a missing information in a text

                                                          * language logic (ifs then elses where we know the answer in advance)

                                                          But by Goodhart's Law, LLMs that have been trained to succeed in those benchmarks might loose powerfulness in others tasks where we really need them (fuzzy inputs, fuzzy outputs)

                                                            • meroes

                                                              today at 3:16 PM

                                                              > arithmetic (why would use LLM for that ?)

                                                              Because people ask LLMs all of these things, including arithmetic. People were saying the same about the number of r's in strawberry. Why ask and LLM that!?!? But the big AI companies want LLMs to be better at these questions, probably because people ask them to LLMs. The big AI companies want this because there is no other explanation for the money poured into RLHF'ing these types of problems.

                                                                • ttoinou

                                                                  today at 3:25 PM

                                                                  for me, that could only be solved by changing architecture and/or introducing more insider tooling (like calling a program to make computation). It doesnt make any sense to fine tune a fuzzy input fuzzy output natural language processing algorithm to add and multiply all combinations of six digits numbers

                                                                    • potatolicious

                                                                      today at 4:36 PM

                                                                      This feels like a philosophical fault line in the industry.

                                                                      For people whose purpose is to produce reliably working systems yeah, training a model that calls out to deterministic logic to do things like math makes total sense. It will pretty much always be more reliable than training a text generation model to produce correct arithmetic.

                                                                      But it feels like there's another side of the industry that's more concerned with... I dunno, metaphysical aspects of these models? Where the idea that the model is a stochastic ball that isn't conscious, isn't thinking, and does poorly at various tasks is anathema. So the effort continues to try and train and fine-tune these models until... something.

                                                                      It reminds me of the great Tesla-vs-everyone-else self-driving debates that raged over the past several years. Lots of people unhappy that the best-functioning systems fused many sensor types and a mixture of heuristic and machine-learned systems in a complex architecture. These folks insisted that the "best" architecture was an end-to-end machine-learned system based entirely on visible light cameras. Because it's "most human" or some other such nonsense. As far as I can tell there was never any merit to this position beyond some abstract notion of architectural purity.

                                                                      Same thing here I suppose.

                                                              • th0ma5

                                                                today at 9:27 PM

                                                                Since when do people like the fuzziness of outputs? I think you make an interesting point but it also seems to imply that benchmarking will never truly be possible, which I think is true unless we can also make them observable which also as you say gives up the mystique.

                                                            • rybosworld

                                                              today at 8:53 PM

                                                              Based on the comments, I think a lot of people are missing what the AI Agent actually got wrong here. Nowhere did the agent claim that 45 + 8 = 63.

                                                              You can see the Agent's step by step thought process here (also linked in the article):

                                                              https://ibm-cuga.19pc1vtv090u.us-east.codeengine.appdomain.c...

                                                              The Agent correctly entered the starting point (MIT) and the ending point (Harvard) and the mode of transport (on foot). OpenStreetMap returns this as taking 45 minutes long.

                                                              Then the agent reversed the directions, and changed the mode of transport to car. What it should have also done, is change the destination to Logan Airport. This is the part that the agent missed. OpenStreetMap then returns that the drive from Harvard to MIT takes 8 minutes.

                                                              The agent then returned the answer as being 45 minutes walking and 8 minutes driving. The first number is correct. The second is wrong because the agent chose the wrong destination, not because it did math incorrectly.

                                                              Seems like lots of readers are chomping at the bit to prove how stupid the models are rather than focus on the real problem the author is highlighting.

                                                                • suddenlybananas

                                                                  today at 9:42 PM

                                                                  The model's scoring was done by another model though no? That was the source of the answer being mislabed as correct. So a different model thought that 45+8=63.

                                                                  • asadotzler

                                                                    today at 9:14 PM

                                                                    "champing"

                                                                • anupj

                                                                  today at 1:25 PM

                                                                  AI agent benchmarks are starting to feel like the self-driving car demos of 2016: impressive until you realize the test track has speed bumps labeled "success"

                                                                  • rsynnott

                                                                    today at 2:23 PM

                                                                    > 45 + 8 = 63

                                                                    > Pass

                                                                    Yeah, this generally feels like about the quality one would expect from the industry.

                                                                    • beebmam

                                                                      today at 3:59 PM

                                                                      I don't think "Benchmarks" are the right way to analyze AI-related processes, which is probably similar to the complexity surrounding human intelligence measurements and how well each human can handle real-world problems.

                                                                      • RansomStark

                                                                        today at 1:27 PM

                                                                        I really like the CMU Agents Company approach of simulating a real world environment [0]. Is it perfect, no. Does it show you want to expect in production, not really, but it's much closer than anything else I've seen.

                                                                        [0] https://the-agent-company.com/

                                                                          • yeahyeahok

                                                                            today at 5:36 PM

                                                                            Damn. Super bullish on CMU. Somehow, they seem routinely left out of the top CS schools discussion at least in mainstream discourse: MIT, Stanford, Cal, .... Seen a disproportionate amount of stellar research come from there. Also, interestingly, I have met really incompetent people from all the other top 3 schools but am yet to meet an incompetent CMU SCS alum -- wtf are they feeding them in pitsburgh??

                                                                        • TheOtherHobbes

                                                                          today at 2:02 PM

                                                                          Any sufficiently hyped technology is indistinguishable from magic.

                                                                          • neehao

                                                                            today at 4:32 PM

                                                                            And I would say, often we need effortful labels by groups of humans: https://www.gojiberries.io/superhuman-level-performance/

                                                                            • mycall

                                                                              today at 1:49 PM

                                                                              SnitchBench [0] is unique benchmark which shows how aggressively models will snitch on you via email and CLI tools when they are presented with evidence of corporate wrongdoing - measuring their likelihood to "snitch" to authorities. I don't believe they were trained to do this, so it seems to be an emergent ability.

                                                                              [0] https://snitchbench.t3.gg/

                                                                              • KTibow

                                                                                today at 5:05 PM

                                                                                This is more or less a funnel to their Agentic Benchmark Checklist: https://arxiv.org/abs/2507.02825

                                                                                  • nerevarthelame

                                                                                    today at 8:57 PM

                                                                                    Finally, a benchmark for benchmarks. And what's great is that they already benchmarked their benchmark benchmark.

                                                                                    (Apologies for the benchmark snark. I'm glad people are doing this research, thanks for sharing it.)

                                                                                • let_tim_cook_

                                                                                  today at 2:37 PM

                                                                                  Are any authors here? Have you looked at AppWorld? https://appworld.dev

                                                                                  • xnx

                                                                                    today at 1:33 PM

                                                                                    All benchmarks are flawed. Some benchmarks are useful.

                                                                                      • yifanl

                                                                                        today at 1:44 PM

                                                                                        Here's a third sentence fragment: These benchmarks are not.

                                                                                          • lcnPylGDnU4H9OF

                                                                                            today at 5:10 PM

                                                                                            Just want to nit: none of those are sentence fragments, they are complete thoughts with a subject and a predicate. Yours kinda comes close to being a fragment but it really just omits what "are not" (the predicate) is referring to, which is included in prior context.

                                                                                            For example, a fragment with a missing predicate.

                                                                                            • suddenlybananas

                                                                                              today at 2:02 PM

                                                                                              It's nearly a haiku!

                                                                                                • layer8

                                                                                                  today at 3:52 PM

                                                                                                    All benchmarks are flawed.
                                                                                                    Not all benchmarks are useless.
                                                                                                    But these benchmarks are.

                                                                                      • greatpostman

                                                                                        today at 1:39 PM

                                                                                        Benchmarks aren’t broken, the models can learn anything. If we give them true real world data (physics engine), they will learn the real world. We are going to see artificial general intelligence in our lifetime

                                                                                          • hddbbdbfnfdk

                                                                                            today at 5:03 PM

                                                                                            more like in the next two weeks methinks

                                                                                        • camdenreslink

                                                                                          today at 1:59 PM

                                                                                          The current benchmarks are good for comparing between models, but not for measuring absolute ability.

                                                                                            • qsort

                                                                                              today at 2:04 PM

                                                                                              Not even that, see LMArena. They vaguely gesture in the general direction of the model being good, but between contamination and issues with scoring they're little more than a vibe check.

                                                                                              • fourside

                                                                                                today at 2:35 PM

                                                                                                But if the test metrics are fundamentally flawed they might not be useful even for relative comparisons. Like if I told you that Model A scores 10x as many blorks points as model B, I don’t know how you translate that into insights about performance on real world scenarios.

                                                                                                • rsynnott

                                                                                                  today at 3:13 PM

                                                                                                  I don't really buy that they're even necessarily useful for comparing models. In the example from the article, if model A says "48 + 6 minutes" and gets marked correct, and model B says "63 minutes" (the correct answer) and gets marked correct, the test will say that they're equivalent on that axis when in fact one gave a completely nonsense answer.