Imustaskforhelp
yesterday at 10:00 AM
I remember the gpt-5 benchmarks and how wildly inaccurate they were data-wise. Linking one[0] that I found so that other people can remember what I am talking about. I remember some data being completely misleading or some reaching more than 100% (iirc)
And this is something which has reached the public eye in one of the most anticipated videos basically. So I find it a bit rough as to think that OpenAI has the best practices for data, and if the public can be shown these inaccurate graphs themselves on based on benchmarks. I find it a bit harder to trust the benchmarks themselves and if OpenAI wants legitimate benchmarks.
Also I find it wild that after 1 month of this, nobody talked about it. I remember thinking that this is gonna be the highlight for a long time that a mega billion dollar company did such basic graph errors. I feel like we are all forgetting a lot of things as our news cycle keeps on moving faster.
(Another tangential point is about the OpenAI/Google employees who had signed the pledge yet nothing came out of it and this is something more recent & I also remember one of your comments on Hackernews.)
> I'm an OpenAI employee and I'll go out on a limb with a public comment. I agree AI shouldn't be used for mass surveillance or autonomous weapons. I also think Anthropic has been treated terribly and has acted admirably. My understanding is that the OpenAI deal disallows domestic mass surveillance and autonomous weapons, and that OpenAI is asking for the same terms for other AI companies (so that we can continue competing on the basis of differing services and not differing scruples). Given this understanding, I don't see why I should quit. If it turns out that the deal is being misdescribed or that it won't be enforced, I can see why I should quit, but so far I haven't seen any evidence that's the case. [1]
This is a bit off-topic so sorry about that, but I hope that you realize that you did say you will go out on a limb with public comment so please don't mind if I ask for some questions, everyone supported you then and heck, even I thought that maybe I was wrong and I thought that I should trust you more than my gut-instincts because you clearly must know so much more than me/us but that aged like fine milk.
I would really love some answers or your thoughts now on that off-topic thought as well if possible as these are just some questions which are unanswered by you and I would love to have a respectful discussion about it, sorry for catching you off guard, waiting for your reply and I wish you to have a nice day ted.
[0]: https://www.reddit.com/r/BetterOffline/comments/1mk6ofz/gpt5...
[1]: https://news.ycombinator.com/item?id=47191196
tedsanders
yesterday at 7:28 PM
> I remember the gpt-5 benchmarks and how wildly inaccurate they were data-wise. Linking one[0] that I found so that other people can remember what I am talking about. I remember some data being completely misleading or some reaching more than 100% (iirc)
Yeah, I found that slide very embarrassing. It wasn't intentionally inaccurate or misleading - just a design error made right before we went live. All the numbers on that slide were correct, and there was no problem in terms of research accuracy or data handling or reward hacking. A single bar height had the wrong value, set to its neighbor. Back then, we in the research team would generate data and graphs, and then hand them off to a separate design team, who remade the graphs in our brand style. After the GPT-5 launch with multiple embarrassingly bad graphs, I wrote an internal library so that researchers could generate graphs in our brand style directly, without the handoff. Since then our graphs have been much better.
I don't think it's unfair to assume our sloppiness in graphs translates to sloppiness in eval results. But they are different groups of people working on different timelines, so I hope it's at least plausible that our numbers are pretty honest, even if our design process occasionally results in sloppy graphs.
Regarding the DoW deal, I don't want to comment too publicly. I also can't say anything with confidence, as I wasn't part of the deal in any way shape or form. My perception from what I have read and heard is that both Anthropic and OpenAI have good intentions, both have loosened their prior policies over time to allow usage by the US military, and both have red lines to prohibit abuse by the US military. One place they differ is in the mechanisms employed to enforce those red lines (e.g. usage policies vs refusals vs human oversight). Each company asserts their methods are stronger than the other's, so I think we have to make our own judgments there. Accounts from the parties involved in the negotiations also conflict, so I don't think anyone's account can be trusted 100%. With that caveat, I thought this article on the DoW's POV was interesting (seems to support the notion that the breakdown wasn't over differing red lines, especially since they almost managed to salvage the deal): https://www.piratewires.com/p/inside-pentagon-anthropic-deal...
Lastly, I hope it's obvious to everyone that Anthropic is not at all a supply chain risk and the threats there were incredibly disappointing. I support them 100% and I'm glad to see them unhurt by the empty threats.
curioussquirrel
yesterday at 8:47 PM
Thank you for the transparency and insights! Very helpful.
We actually did the same thing re generating charts in brand style to avoid any mishaps, since then I sleep much better
This is what makes HN great: We get to hear from the people and not (only) the media dept. Thanks for your honesty and openness. I trust OpenAI a lot more when I hear balanced accounts like this.