sjt-at-rev
yesterday at 2:01 PM
I think the other side of that coin is how much effort it takes to get it to do what you need. Our pipeline is a sequence of very precise tasks where subtle contextual cues matter a lot, and there are a large classes of related error modes.
So yes, while we can work with any of these models to get them to do what we need eventually -- e.g. with prompt tuning to their particular style, adding more examples, or breaking tasks into smaller steps, etc. -- their instruction following has a huge impact on how quickly we can move as a team.
When I say "stinks", for me, if we do three rounds of optimization and testing and a model is still performing inconsistently across a class of related traps then using that model is going to slow us down, and I think it stinks.
In my experience, gemini3.1pro tends to work very consistently with light nudging, GLM with 2-ish rounds of optimization, and for GPT5.4, well it provided no improvement over prior models and would slow us down over the others meaningfully ... and costs too much for the effort.
So, meh, so I still think it stinks, skill level considered.